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FOREWORD 



This Summary Report No. IITRI-C6156-18 entitled "Four Year 
Summary- -Educational and Commercial Utilization of a Chemical 
Information Center" summarizes work carried out on IITRI Project 
C6156 for the period June 25, 1968 to June 25, 1972. The project 
was funded by the National Science Foundation under Contract 
NSF-C 554 and was monitored via the NSF Office of Science Informa- 
tion Service. 

The project leader throughout the four years was 
Martha E. Williams, the principal investigator until April 1971 
was Eugene Schwartz, and the programming coordinator throughout 
the time period was Peter B. Schipma. 

Contributions to this report were made by Martha E. Williams, 
Peter B. Schipma, Scott E. Preece, David S. Becker, 

Patricia A. Llewellen, and Alan K. Stewart. 

We would like to acknowledge the significant contributions 
to the project made by Eugene S. Schwartz in the area of system 
design, by Barbara M. Louthan in the area of programming the 
logic evaluation, and by Elaine Onderisin in originating the 
Least Common Bigram search technique. The project has been 
carried out as a team effort and significant inputs to the sys- 
tem design, design of programming module functions, operational 
procedures, and user requirements were made by all of the pro- 
fessional staff. We would like to acknowledge the former staff 
members Barbara Boone, John North, Henry Saxe, and Allan Shafton 
whose efforts have contributed to the success of the program. 
Finally, we would like to thank Arline Finnegan whose efforts as 
the CSC technician in handling output and maintaining records 
have provided essential support to the Center. 
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Martha E. Williams 
Manager 

Information Sciences 
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ABSTRACT 



FOUR-YEAR SUMMARY 

EDUCATIONAL AND COMMERCIAL UTILIZATION 
OF A CHEMICAL INFORMATION CENTER 

The major objective of the IITRI Computer Search Center 
(CSC) is to educate and link industry, academia, and government 
institutions to chemical and other scientific information sys- 
tems and sources. The CSC was developed to meet this objective 
and is in full operation providing services to users from a 
variety of machine -readable data bases with minimal restrictions 
and a high degree of flexibility. A new modular machine-inde- 
pendent PL/1 software system was developed for handling virtually 
any bibliographic-type data base. CSC's transferable programs 
have run at fifteen different computer facilities with different: 
hardware, computer models, versions of OS, peripherals, and 
releases of the PL/1 compiler. All data bases are converted by 
a preprocessor to a standard IITRI format which employs a 
directory and character string type of file structure and are 
searched by a software system that employs the novel IITRI- 
developed Least Common Bigram search screen technique. 

User oriented profile features include: full free form 

Boolean logic with any degree of nesting; search terms may be 
any data element on a data base; search terms may be single 
words, multi-word terms, phrases, or term fragments; full 
truncation capabilities; option for sorting output by author, 
citation number, or weight; and options for sorting output by author, 
on 5 x 8 cards, multilith masters, paper, magnetic tape, or 
COM. User aids were developed for each data base to assist in 
profile development and monitoring. They include: a Search 

Manual, data base oriented supplements to the Search Manual, 
Truncation Guides, term frequency lists, KLIC Indexes, and 
Search Term Frequency/Issue lists for each profile. 
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Research is conducted and statistics maintained to con- 
tinuously study, monitor, and improve Center components in- 
cluding data bases, profiles, systems, personnel functions, 
and user services. 

Education and training is provided through seminars, 
workshops in profile preparation, and a graduate course in 
"Modern Techniques in Chemical Information". The educational 
and marketing efforts familiarize users and potential users 
with the many advantages of computerized retrieval, which are 
the raison d'etre for the center, including: access to wide 

coverage; thoroughness of search; consistency of search; inter- 
disciplinariness of data bases; high recall; speed of search; 
regularity of information dissemination; timeliness; automated 
personal file preparation and maintenance; and cost effectiveness. 
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FOUR- YEAR SUMMARY 



EDUCATIONAL AND COMMERCIAL UTILIZATION 
OF A CHEMICAL INFORMATION CENTER 

1. INTRODUCTION 



1.1 Report Description 

This report summarizes and organizes all of the signif- 
icant findings and information recorded in the previous IITRI 
reports C6156-1 through C6156-17. The Summary Report provides 
a comprehensive overview of the research performed in estab- 
lishing and operating the Computer Search Center (CSC) at 
IITRI, and provides an overall analysis of these data. The 
evolution of ideas and design parameters is also given in 
historical perspective. Thus this report can be used to 
trace the course of the project in lieu of piecing together 
the Quarterly Reports that detailed work in progress. 

The Summary Report is composed of fourteen major sec- 
tions that detail the research activities from conception 
and design through implementation and operation. The second 
part of this section, the INTRODUCTION, provides the history 
and background of the project. It presents the perspective 
from which to view the balance of the report. 

Section 2 covers the COMPUTER SEARCH CENTER DESIGN AND 
DEVELOPMENT. Our initial objectives, initial system design 
made to meet those objectives,, and development of the design 
are discussed. The bases for our original decisions on hard- 
ware, programming language and program features for installa- 
tion independence are given. 

Section 3 is concerned with the SERVICES provided by 
the CSC. These include Selective Dissemination of Informa- 
tion (SDI) and retrospective searches, the Private Libraries 
System and software installation. Sections 4 and 5 elaborate 
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on these services, with 4 covering PROFILE PREPARATION AND 
MODIFICATION, including discussions of profile forms, formats 
and features, while 5 describes the SOFTWARE SYSTEM. Section 
5 contains both descriptive information on topics such as 
the IITRI file structure, data base format conversion, search 
strategy and logic evaluation, and definitive information on 
the program set, core requirements, and files. 

Sections 6,7, 8, and 9 relate to the relationships 
among the users, the data bases and CSC. Section 6 gives 
DATA BASE CHARACTERISTICS AND COMPARISONS. The next section 
describes the USER AIDS we have developed: Search Manual 

and Supplemental Guides , KLIC Indexes, Term Frequency lists 
and Truncation Guide . The next section in this group covers 
USER EVALUATION AND FEEDBACK. Section 9 covers EDUCATION - 
USER LIAISON by discussing workshops, seminars, courses, and 
the Workbook on Modern Techniques for Chemical Information. 

Section 10 describes all the functions of CENTER MAN- 
AGEMENT AND PROCEDURES necessary to serve the users. It 
includes such topics as profile, data base , and user record 
handling as well as internal statistics, marketing, and re- 
lationships with other centers. 

Section 11 covers the RESEARCH activities of the pro- 
gram. Many analytical studies of data bases and linguistic 
analyses of profile-citation interfaces were made in the course 
of this project. Some basic facts were discovered, such as 
finding that lexicographical ordering based on letters from 
left to right in a word is a poor ordering form upon which to 
base a search algorithm. The applications of this and other 
findings to text searching are discussed. 

The last three sections, 12, 13, and 14 present listings 
of conferences, presentations, publications, and professional 
activities carried out in conjunction with the program; a list 
of REFERENCES and our SUMMARY AND CONCLUSIONS. 
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1 . 2 History and Background 

The proliferation of chemical literature over the past 
several decades has been a growing source of concern to both 
the professional scientist and his management. There are now 
over 300,000 papers per year referenced in Chemical Abstracts, 
and 250,000 per year in Biological Abstracts. Several years 
ago a government research executive was quoted as saying: 

"If the research program cost $100,000 or less, it is less 
expensive to do it again than to make sure it has not been 
done before!" This statement, fortunately, is no longer true. 
Many of the principal secondary sources --indexing and ab- 
stracting journals as well as other collections of information - 
have been established for searching these new data bases to 
provide scientists and engineers with an inexpensive means of 
coping with the scientific literature. Currently more than 
two million scientific and technical papers are published 
each year and even with the use of abstracting and indexing 
journals, it is no longer feasible for the average scientist 
to keep up in his own field if he must rely on manual searching 

Numerous solutions to the information explosion problem 
have been posed, such as reducing the number of articles pub- 
lished, publishing only summaries or abstracts' of articles, or 
retaining full documentation on magnetic tape only, and an- 
nouncing the existence of the information to persons in the 
appropriate subject areas. The implementation of such solu- 
tions in our "publish or perish" society where publications 
effect both salary and ego boosting would seem to indicate 
that printed publications, either full articles or as shortened 
versions, are here to stay. Hence, the machine-readable 
versions of these or their surrogates will need to be searched 
by information centers. 

The cost of keyboarding or otherwise preparing large 
machine-readable files is high, and until recent years when 
the preparation of machine-readable records was done for pur- 
poses of computerized typesetting or to speed up publication, 
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the cost of inputting information could seldom be justified on 
the basis of information retrieval. Computer-readable files 
are now being produced in significant numbers and even though > 
in most cases, the file is created as a by-product of publi- 
cation activity, the file does exist and can be searched. 

A survey by the American Institute of Physics^" iden- 
tified 50 commercially available scientific and technical 

data bases. The Directory of Computerized Information in 

2 

Science and Technology has identified several hundred addi- 
tional data bases--most of which are specialized and small. 
There are currently perhaps 10-20 popular data bases and many 
more that enjoy limited use. The Association of Scientific 
Information Dissemination Centers (ASIDIC) , Cooperative Data 
Management Committee, recently published the ASIDIC Survey of 
Informa tion Center Services and found that the 56 responding 
centers identified 48 publically-available data bases that 
they are processing either for SDI (selective dissemination 
of information) or retrospective searches. 

In the late 1960's the Office of Science Information 
Services (OSIS) of the National Science Foundation (NSF) 
recognized the need for data base services and research and 
development regarding the data bases, data base services, and 
operational aspects of centers that handle machine -readable 
data bases. Accordingly, NSF provided seed money for several 
"university based information centers". These are located at 
the University of Pittsburg, the University of Georgia, Lehigh 
University, the University of California (UCLA) and at IIT 
Research Institute (IITRI) . Although IITRI is a not-for- 
profit contract research organization and not a university, 
it is affiliated with Illinois Institute of Technology (IIT). 

The IITRI Computer Search Center (CSC) was established 
in 1968 and was designed as a one-stop information center to 
meet user needs by providing a variety of desired sources and 
services with minimal restrictions and a high degree of flexi- 
bility. Services include both current awareness (SDI) and 
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retrospective searches tailored to a user's or organization's 
needs. Users of the Center are scientists and engineers in 
industry, universities, and other research organizations. 

The SDI system has been operational since September 1969 
and CSC offers services from Chemical Abstracts' Condensates 
(CA), Biological Abstracts' Previews (BA), and Engineering 
Index's COMPENDEX (El) on a production basis. CSC plans to 
add the International Food Information Service data base in 
the fall of 1972. 



2. COMPUTER SEARCH CENTER DESIGN AND DEVELOPMENT 

2.1 Objectives 

Chemists generate, need, and use chemical information as 
indicated by the existence of a large number of primary chemical 
journals, by the size and growth rate of the secondary abstract- 
ing journals, and by the existence of chemical libraries in many 
commercial, educational, and government research and develop- 
ment installations. The more than 100,000 chemists in the Ameri- 
can Chemical Society spend a significant amount of time perusing 
the literature. It was noted in an article in the July 28, 1969 
issue of Chemical and Engineering News that the average amount 
of time spent by an industrial chemist on current awareness 
reading is 7.5 hours per week. 

ACCESS, the listing of journals by Chemical Abstracts 
Service (CAS), names more than 20,000 chemical or chemistry- 
related journals. This number does not take into account in- 
house publications or research and development reports pre- 
pared by industry, government, and government contractors both 
within and outside of the United States. 

Traditionally, the rehandling and distribution of technical 
information has been done by means of printed publications. 
Because the volume of scientific literature has grown so large, 
it has become necessary to employ automation and new techniques 
to make the information available to users within a reasonable 
time span. Much has been done and reported regarding ' computer 
techniques for composition, storage, search, and retrieval of 
chemical information. However, it is necessary to utilize the 
newer techniques and sources and to train the users --the bench 
chemists--who are familiar with the standard and traditional 
sources and means of obtaining information in the use of the 
new technology. 

There is a large volume of chemical information that now 
exists in machine-readable form and there are many chemists who 
are the potential users of this information. A potential market 
exists but there is a real problem in devising methods of bring- 
ing the users to the new information sources or disseminating 
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the information from these sources to users. Information 
scientists at IITRI are helping to solve this problem by the 
operation of the Computer Search Center. 

2.2 Design and Development 

The CSC system was designed to provide a variety of infor- 
mation storage and retrieval -type services from a multiplicity 
of existing and future data bases, with numerous profile options, 
flexible search strategies, variable sort options, and varying 
output media. This was to be done in a manner that would per- 
mit us to use one generalized software system that would be easy 
to modify and alter and would be machine independent and in- 
stallation independent. 

The general objectives led to the establishment of design 
requirements and the development of special features for the 
CSC system. Requirements included: program transferability; 

machine independence and installation independence; ability to 
handle numerous data bases; development of general purpose 
programs; and modularity. Special features included: aggre- 

gation of profile terms; left and right truncation of terms; 
free-form Boolean logic; removal of redundant search terms; 
options for sorting of output; options for media on which out- 
put is printed; and designation of hit terms, index terras, and 
weight on each output citation form. 

Because none of the computer search programs available at 
the time met all of the criteria required by the Center, and 
because of the need to handle a variety of data bases, new 
general purpose computer programs were written. The compiler 
language PL/1 was employed to achieve machine and installation 
independence and hence a high degree of program transferability. 

CSC programs were initially written and debugged using the 
RUSH (Remote-User-Shared-Hardware) interactive programming 
system. Using a terminal at IITRI, programs were written, 
compiled, and debugged on a 360/50 in Palo Alto, California. 

RUSH is a dialect of PL/1 and programs were developed avoiding 
those features and statements in RUSH that were not currently 
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in PL/1. Once the programs were written and debugged on RUSH, 
they were converted to PL/1 and run on several 360's in the 
Chicago area. The transition from RUSH to PL/1 went very 
smoothly. 

The programs were written in a modular fashion so that 
changes, additions, and deletions could be readily accommodated. 

A separate block was written for each separate operation within 
a program. The basic functions provided by the programs are 
source tape format conversion, profile preparation, search, 
output generation, and maintenance of statistics. The programs 
are described in detail in Section 5 of this report. 

The basic set of programs was written, tested, and put 
in production in September 1969. At that time a pilot group 
of users prepared 146 profiles for searches of CA Condensates. 
Subsequently, BA Previews and COMPENDEX were added to the pro- 
duction system. 

The number of users and profiles has varied from time to 
time as new users have been brought into the system and experi- 
mental profiles were tried. Users represented industry, 
academia, and government, with the majority being from industry. 

Throughout the course of the project and as production 
data accumulated we have made continuing efforts to update, 
s tr earn line, and increase the effectiveness of our computer 
programs. These efforts have been rewarded as is evidenced by 
a very great reduction in computer processing time required for 
the weekly production searches (see Section 10). 

In addition to the creation of an operational computer 
search, retrieval, and dissemination system, IITRI has instituted 
educational and training programs, the purpose being not only to 
develop a center, but to ensure its continuing use in the future. 
This objective led to the development of a Search Manual for 
profile preparation, the development of a workbook in Modem 
Techniques in Chemical Information , the teaching of a new 
academic course at Illinois Institute of Technology, and the 
presentation of seminars. A detailed discussion of the educa- 
tional aspects of the project is given in Section 9 of this 
report. 
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2.3 Programming Language Selection 

The CSC design criteria of software transferability, 
machine independence, and installation independence together 
with the desire to carry out coding tasks in a relatively 
short time period while generating modular, flexible general 
purpose programs led to the decision to develop software 
in a higher level compiler language rather than in a machine 
language or assembly language. 

We investigated several compiler languages such as 
FORTRAN, COBOL, ALGOL, and PL/1 and selected PL/1 because of 
its flexibility, generality, power, and modularity. The pro- 
gram goal of producing programs that could be transferred to 
other organizations ruled out the machine dependent machine 
level and assembler level languages. Of the higher level 
compiler languages, PL/1 appeared to offer the best balance 
of flexibility, generality, power, and modularity necessary 
for the other goals of generating a program set that could 
change as data bases changed, incorporate new data bases and 
contain the features desired by users. 

PL/1 is currently available on IBM 360 and 370 series 
hardware, which comprises the majority of computer installa- 
tions. Burroughs has announced a PL/1 compiler and one for 
the Digital Equipment Corporation PDP-10 is nearly ready. 
Univac, CDC, and others are preparing PL/1 compilers. Thus, 
PL/1 will shortly be quite machine independent. Even con- 
sidering the 360-70 family as a limitation of sorts, there 
is wide variation among the many models in this series. CSC 
programs have run on over 15 different configurations of 360's 
and 370's with no problems. If currently not machine indepen- 
dent, PL/1 assuredly has a high degree of configuration- 
independence. 

Many of PL/l's features are eminently suitable for text 
processing. These include character and bit string handling 
functions, structure variables, hierarchical data structures 
and arrays, list processing capabilities, and device indepen- 
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dent I/O. By using these features, we have been able to 
implement all of the system concepts we have evolved in 
the PL/1 language. In no case did the language limit our 
design options. This is due to the rich syntax of the 
language and speaks very highly for its flexibility. 

PL/1 is admittedly not as efficient as an assembler 
level language and there are usually many ways to do any 
operation--with varying degrees of efficiency. However, 
the use of modular programming techniques and the power of 
the language have overcome this lower execution efficiency. 
Since we were able to try out design modifications in very 
short amounts of time and without disrupting a production 
schedule, we were able to devote less time to programming 
and coding and more time to investigation of what really 
goes on in a bibliographic search system. This enabled us 
to test six different search techniques and to develop such 
concepts as that of the Least Common Bigram which more than 
offset the efficiency differences of PL/1 and assemble? 
level languages. Such multiple testing would not have been 
possible within reasonable constraints of time, dollars, 
and the realities of a production activity without a compiler 
level language such as PL/1. Modular programming techniques, 
easily implemented in PL/1, allowed us to make changes in 
portions of the set (to accomodate a new data base, for ex- 
ample, or to react to a data base format change) with no 
interruption to production and without changing all the pro- 
grams in the set. 

In addition, PL/1 is quickly learned and it is possible 
to familiarize new staff members with the overall programming 
system in a relatively short time. With a sophisticated set 
of assembler language programs the termination of a staff mem 

ber is likely to be a more traumatic experience than is the 
case with PL/1. 
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2.4 Computer 

One of CSC 1 s objectives was the development of programs 
that could run at a variety of installations. Inasmuch as the 
IBM 360 family of computers represents a large segment of the 
computer field and PL/1 compilers are available, we decided 
to program for the 360-70 series computers. Initially, the 
choice of PL/1 tied us to 360-70 machines but since more than 
50 percent of the computers in the country fall in that category 
this limitation did not pose a serious constraint. Subsequently, 
Burroughs has announced a PL/1 compiler, one is under development 
for Digital Equipment Corporation's PDP-10, and proprietary com- 
pilers exist for CDC, Honeywell, and Uni vac equipment so the 
boundaries seem to be relaxing. Although, for instance, FORTRAN 
compilers are available for many makes of computers, transfera- 
bility is not a surety. Only parts of FORTRAN as a whole are 
basic to all the hardware, and thus we would have imposed quite 
severe limitations upon ourselves with that choice. 

CSC programs will run on IBM 360 's from a Model 40 on up. 
They require a minimum of two tape drives , one or more disks , 
and, assuming approximately 3000 search terms (200 profiles 
of 15 terms each), 256K bytes of core storage. 

We believe that our design philosophy has been a service- 
able one. We have demonstrated the utility of PL/1 and use 
of the IBM 360 in that we were able to develop a sophisticated 
information retrieval system and get into a production mode in 
a relatively short period of time. We have been able to test 
many alternative programming approaches and implement changes 
to the system as needed and we have run the system on 15 dif- 
ferent computers. (See Section 2.6). 
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2.5 Modularity and Program Modules 

Programs were developed in a modular fashion in order to 
permit changes, additions, replacements, and deletions in 
programs and program modules without affecting the entire 
system. A separate block was written for each separate oper- 
ation within a program. There are five basic functions carried 
out by the programs. The programs together with the names of 
the eleven specific program modules that accomplish the func- 



tions 


; are : 






Program Function 


Program Module Name 


(1) 


Preparation of data base 


DBCOPY (Data Base Copy) 




input 


FORCON (Format Conversion) 
IFCOPY ( I ITRI -Format Copy) 


(2) 


Preparation of profile 


DKEDIT (Deck Edit) 




input 


MINIPUP (Mini Profile Update 
Program) 

IN PUT R (Profile Input 
Preparation Routine) 


(3) 


Search data base for 
profiles 


SEARCH 



(4) Preparation of search 
output 



HITTER (Hit Recorder) 
DBCARD (Data Base Card 
Format) 

DBOCP (Data Base Output 
Control Program) 



(5) Statistics generation STIXA (Statistics) 

A twelfth program which is optional is call PLSXT (Private 
Libraries System Extraction) and is used for extracting data 
from the SDI system to be used as input for the Private Libraries 
System (PLS). PLS is a software system for creating and main- 
taining private files or subset data bases. It is discussed 
in Section 5.8. 
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The interrelationships of the programs and component 
modules can be seen in the simplified flow chart Figure 2-1. 
Details regarding the specific programs and their relation- 
ships to each other and to the files that are used for 
communication between programs and modules are given in 
Sections 5.3 and 5.5. 

Via the modularity feature the total software system 
is constructed of multiple individually replaceable and 
changeable building blocks. Individual modules or programs 
can be changed or replaced without affecting other portions 
of the same program or other programs (and specific sub- 
routines can be called for in certain cases and not others) 
thus permitting a high degree of flexibility. 

An example of this feature can be seen in the fact that 
the form--'- conversion module (FORCON) is different for each 
data base yet the programs and files it interfaces with are 
unaffected. Also the output card formatting module (DBCARD) 
is different for each data base depending on which of the 
data elements contained on the data base are to be displayed 
on the output cards. DBCARD interfaces with other portions 
of the system which remain the same regardless of whether the 
specific DBCARD program is for Chemical Abstracts (CACARD) , 
Biological Abstracts (BACARD) , or Engineering Index (EICARD) . 

In addition to this replacement feature is the ability 
to revise specific programs as needed. For example, if a 
data base supplier adds a new data element to his files or 
changes format, we can change the FORCON program to accomo- 
date the supplier change. This can be done readily and easily 
In fact, we have made hundreds of minor changes to individual 
program modules and have never interrupted the production 
activity of our weekly runs. More significantly, we have 
been able to make major changes to programs and conduct 
comparative tests quickly and inexpensively. For example, 
the basic search strategy has been changed several times and 

other approaches have been tested. These tests are discussed 
in Section 5.6. 
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Figure 2-1 

SDI SYSTEM GENERALIZED FLOW CHART 
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2 . 6 Transfer ability- -Machine and Installation Independence 
Machine and installation independence permit transfer- 
ability of software, which was one of the CSC design goals. 
Reasons for this design goal were: anticipation (realized 

within a year) of a hardware change at IITRI; the desire to 
install our software in organizations that needed an internal 
SDI system; and the desire to conduct profile writing work- 
shops and training courses both on-site and at other locations. 
Successful achievement of this design goal is evident from 
the fact that we have installed the system at several indus- 
trial organizations and have run the programs at 15 different 
computer facilities with no real difficulties. Preparation 
of appropriate JCI. is usually all that is required. Figure 2-2 
indicates the variety of hardware, processors, versions of 
the 'operating system and releases of the PL/1 compiler that, we 
have used. 



Hardware : 



IBM 360 



Models : 40 

50 
65 
67 
75 



IBM 370 Model: 155 



Any computer with PL/1 
Compiler 



Processors: MFT 

MVT 

PCP 

HASP 



Operating System Versions : 15-16 

17 

18 

19 

19.6 

20 
21 



PL/1 Compiler Releases: 4.1 

5 

5.2 



Figure 2-2 

ENVIRONMENTS UNDER WHICH IITRI SOFTWARE HAS RUN 
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3. SERVICES 

The CSC was designed to provide a variety of services. 

Among those currently offered are SDI (Selective Dissemination 
of Information), retrospective searches conducted either 
by computer or manually, private library development and 
maintenance, and software installation. SDI is the principal 
service offered by CSC. 

3.1 SDI 

The current awareness or SDI (Selective Dissemination 
of Information) system has been operational since September 
1969, and the Computer Search Center (CSC) is now offering 
services from Chemical Abstracts Condensates, Biological 
Abstracts, Bioresearch Index and Engineering Index's COMPENDEX. 
Searches of other data bases will be added depending on user 
needs. 

The SDI system was designed to include many user-oriented 
features, including: full free form Boolean logic with any 

degree of nesting; many searchable elements; all forms of 
term truncation; weighting; sort options; and print media 
options . 

One n.ay include searchable elements as positive or negative 
search terms, i.e., one may require the presence or absence 
of any particular search term to qualify a citation as a 
"hit" citation. Among the searchable elements are: 

Subject terms appearing in titles, text, 
or as index terms 

Author names 

Company names 

Journal names as represented by the 
standard ASTM CODEN 

Coun try 

CA section numbers 
BA CROSS Codes 
BA BIOSYSTEMATIC Codes 
El Card-A-Lert Codes 



The search terms may be single words, multi-word terms, 
phrases, or portions of words. 

Output may be sorted according to user preference by 
author, weight, or citation number. Standard output is pre- 
pared on 5" x 8" cards. Provisions can be made for printing 
output on paper or multilith masters for further reproduction 
and dissemination within an organization. 

The standard output sent to users is printed on three 
types of cards--header , citation, and trailer. The header 
card as shown indicates: the user profile number, the tape 

service and issue of the tape chat was searched, the number 
of citations that were on tape, the number of citations that 
were hit citations for the user's profile, the number of 
citations that were printed, and the date of the search. 
Examples of header cards for CA, BA, and El are shown in 
Figures 3-1, 3-4, and 3-7. 

There is one 5" x 8" citation card for each hit. A 
citation card includes: citation number; tape source includ- 

ing volume and issue number; profile number; authors (as many 
as are given on the source tape) and corporate authors; full 
title; primary source information including journal volume, 
issue, date, pages, and CODEN; index terms; abstracts; codes 
and any other significant information that may have been 
included on the source tape; search terms present, i.e., those 
profile terms that were hit terms for the particular citation; 
and weight for the citation. Examples of citation cards with 
the data items that are specific to a given data base are 
shown for CA, BA, and El in Figures 3-2, 3-5 and 3-8. Trailer 
cards listing the total citations in a user's output are shown 
in figures 3-3, 3-6, and 3-9. 

Searches are conducted and output sent to users weekly, 
biweekly, or monthly in accordance with the frequency of the 
particular data base to be searched. 

3.2 Retrospective Searches 

Retrospective searches, either manual or by machine, 
are provided on request. The price is dependent on the number 
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Figure 3-2 

CA CONDENSATES OUTPUT-CITATION CARD 
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Figure 3-3 

CA CONDENSATES OUTPUT -TRAILER CARD 
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BA PREVIEWS OUTPUT-HEADER CARD 
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Figure 3-7 

El COMPENDEX OUTPUT-HEADER CARD 
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EI COMPENDEX OUTPUT -TRAILER CARD 



of years searched, the size of the data base (or portion of 
a data base), the number of search terms, and the frequency 
of search terms in the data base. We are currently developing 
programs to provide retrospective searches of indexes, inverted 
files, and/or merged data bases and are planning a service to 
search the forthcoming CAS Integrated Subject File. 

In all cases a judgment is made as to whether a machine 
search or manual search would be most effective and efficient, 
and a recommendation and a cost estimate are then given to 
the requestor. A single term search, for example, can cer- 
tainly be carried out more efficiently by manually .searching 
indexes, whereas a search that employs numerous search terms 
and/or complicated logic might best be done by machine. 

3 . 3 Private Libraries 

Through the Private Libraries System we can create 
tailor-made machine-readable data bases from document 
collections, company report files, and other information 
resources specified by a client. Each such data base, while 
specifically designed in terms of content to reflect the 
particular subject material in the information collection, 
is represented in uniform format on tape. The IITRI data 
base format allows specification of the types of data elements, 
such as author, keyword, or report number within the record 
itself. Different numbers of elements and different elements 
can be specified for individual records and/or data bases. 

The length of each element is also variable and not predeter- 
mined. The flexibility of this format allows us to generate 
data bases from widely varying types of information. Yet, 
our software works with any data base in this format. We have 
programs that allow addition, deletion, and modification of 
entire records or parts of records in order to update, modify, 
and improve the data at whatever time the client wishes. 

Also, bibliographies, concordances, etc., can be inexpensively 
produced from the data base. A private library that is 
specially- tailored for a client is maintained for the client 
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and searched exclusively by client organization personnel. 

3 .4 Software Installation 

IITRI will install its software at a user's installation, 
providing complete checkout of the software and training of 
operational personnel. The installation service includes: 

• Program source decks and complete documen- 
tation, including flowcharts and narrative 
comments 

• Installation and program checkout on-site, 
including JCL and data set preparation 

• Training in running the system, including 
error recovery 

• Detailed training, including test run 
experience, in profile construction and 
refinement 

• IITRI ' s unique user aids 

• On-site production run test under IITRI su- 
pervision 

• Maintenance and development support of 
sof tware 

• Consultation service for user problems 

Installation of IITRI 's software includes many services 

beyond the handing over of an operational set of programs. 

The software itself, of course, is the s ine qua non of the 
installation. We reproduce a full set of source decks, 
then compile them and run a complete test run with the decks 
that will be turned over to the user along with complete 
documentation. These decks are then taken to the user's 
computer facility and checked out by an IITRI specialist. 

At this time JCL for the system is made up, disk and tape 
files assigned, and the software checked out on the user's 
machine. Basic instruction in running the system is given 
to the personnel who will be actively involved 
in the production use of the system, and a test run performed 
including doctored data designed to cause specific errors—— 
both to demonstrate typical malfunctions and to test the 
operational staff's ability to correct errors and proceed. 
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The key phase of the procedure, however, is not the 
software installation, but the profile construction training 
which is performed in a special workshop at IITRI. At 
this point the profile coordinators and users are instructed 
in the techniques necessary to produce effective profiles. 

We supply a complete set of our user aids and detailed in- 
struction in their use. Several test runs are made *'o allow 
the user's staff to get a first-hand knowledge of the ■'ch- 
niques of profile construction and refinement. IITRI 1 s 
unique combination of experience and capability are available 
to the user throughout the set-up period and thereafter, in 
the interest of providing the user with the ability to produce 
effective profiles. 

The final phase of the installation is an on-site 
production test under IITRI supervision. Two complete pro- 
duction runs are done in one week, with every phase of the 
operation carefully checked and monitored by IITRI personnel. 

At this point the user's staff should demonstrate an ability 
to run the system and recover from errors caused by the 
sorts of faulty input that occurs in normal production. 

After the installation is complete, IITRI 's fund of ex- 
perience and detailed knowledge of the internal logic of the 
system is available to the user by telephone or mail. 

In addition, program improvements will be provided as they are 
introduced for a period agreed on. Thereafter future improve- 
ments will be available for a limited charge, allowing the 
user to keep up with the steady improvement in system efficiency 
and effectiveness that results from IITRI' s continuing invest- 
ment in refining and optimizing existing programs and develop- 
ing better methods. 

Thus our installation service is a complete package of 
training and operational components. The unique combination 
provides the user with a comprehensive system which he is capable 
of using maximally. 
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4. PROFILE PREPARATION AND MODIFICATION 

4.1 Profile Forms 

The search profile is the primary input into the system. 

It is a representation of a question by a user in the terminol- 
ogy of a data base and coded according to the conventions of the 
search system. Search terms are the data elements constituting 
a search profile and are common to the terminology of both the 
search question and the data file. Profile information, user 
identification, and the search question are entered on the Header 
profile form illustrated in Figure 4-1. All search terms relevant 
to a particular search profile are listed on the Terms coding 
form shown in Figure 4-2. Each term is assigned a referent (term 
number) in the sequence by which the terms are listed on the cod- 
ing form. Truncation mode and term type are also entered on the 
Term coding form. 

Terms that are semantically associated can be linked to- 
gether in a single expression. Linked terms are synonyms , re- 
lated terms, or hierarchical (broader, narrower) terms. A link 
designator represents the associated terms and can be used to 
simplify the logic expression and to facilitate the cumulation 
of weights. The link designator, a single character from the 
set A-Z, is entered on the coding form. 

Weights are numerical values assigned to search terms that 
indicate their relative significance to the user. The weights 
augment the logic of the expression and increase the retrieval 
effectiveness of a profile. Term weights can range from 0 to 9. 

If the weight option is chosen, the output can be sorted in 
weighted order, with the highest weighted items printed first. 

A print cutoff can be designated by the user to eliminate print- 
ing of the lowest weighted items. 

Two modes of weighting are used to circumvent the problem 
arising from the presence of synonyms or related terms in a 
logic expression. A noncumulative mode selects in a link only 
the weight of the highest weighted term that is found in a 
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PROFILE FORM - HEADER 
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FORM P3 



retrieved citation. A cumulative mode adds the weights of all 
other terms. A threshold weight can be specified and only 
citations that satisfy the logic expression and whose weight 
is equal to or greater than the specified threshold are re- 
trieved. 

The terms and links are associated in a logic expression 
on the Logic coding form illustrated in Figure 4-3. The logi- 
cal operators AND, OR, and NOT can be written in any free form 
Boolean expression with any level of nesting. 

4.2 Profile Options and Features 

The principal features built into the system to achieve 
effective profiles and to allow wide flexibility in the way 
terms can be used are the following: wide variety of term types 

all forms of term truncation; full free form Boolean logic with 
any degree of nesting to relate terms to each other; grouping 
or linking of similar terms; and weighting of terms according 
to user assignment of relevance. Statistics regarding the use 
of various profile options are given in Section 11. 

4.2.1 Terms 

One may include searchable elements as positive or negative 
search terms, i.e., one may require the presence or absence of 
any particular search term to qualify a citation as a "hit". 

The following are term options available to a user. 

Terms- -anything other than single character 
Single word 
Multi-word 
Phrase 

Fraction of term 
Symbol or acronym 

Kinds of Terms --anything on the data base 

Subject terms appearing in titles, text, or 
as index terms 

Author names 
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Kinds of Terms (cont'd) 

Company names 

Journal names as represented by the 
standard A STM CODEN 

Country 

CA section numbers 
BA CROSS Codes 
BA BIOSYSTEMATIC Codes 
El Card-A-Lert Codes 

4.2.2 Truncation 

Since many data bases include titles, which are author 
generated and therefore uncontrolled, it is necessary to in- 
clude in one's profile all forms of a desired term to ensure 
retrieval of the desired information. In order to simplify 
this task of specifying all possible relevant word forms and 
fragments, CSC has allowed all options in truncation. Left, 
right, both, and none modes of truncation are permitted. When 
a search term is specified with no truncation, it requires an 
exact match with a term on the data base. Left truncation 
allows substitution of any prefix; right, of any suffix; and 
both, allows all of the preceding plus simultaneous substitu- 
tions of prefix and suffix on a term or term fraction. (See 
Figure 4-4) . In addition to these four modes there is a fifth 
possibility, infix truncation, wherein substitution is allowed 
on an infix while prefix and/or suffix remain constant; we 
are considering the possibility of adding infix truncation to 
the CSC system. Figure 4-5 shows how it would be used. 

Truncation can be used with any kind of data element or 
term type in a given data base. The usefulness of right trun- 
cation is usually readily understood. Right truncation is used 
to select singular, plural, and other forms of words that con- 
tain a common stem. In order to regularize the use of commonly 
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Mode 




Func tion 




Example 


none 


requires exact match of a 
term 




term 

AZO 




left 


allows 

prefix 


substitution of any 
on the term 


* 

DI 


term 

AZO 




right 


allows 

suffix 


substitution of any 
on the term 




term 

AZO 


k 

XY 


both 


Allows 

prefix 


substitution of any 
and/or suffix 


A 

DI 


term 

AZO 


k 

METHANE 



NOTE: * denotes truncation 



Figure 4-4 
TRUNCATION MODES 

truncated terms and to assist in the selection of optimal trunca- 
tion forms, we have prepared a Truncation Guide for right trunca- 
ted words. See Section 7 on User Aids for details. The use of 
right truncated terms is quite apparent. On the other hand, 
the usefulness of left truncation is not so obvious but it can be 
readily demonstrated. For example one might use the left trun- 
cated term *MYCIN to represent antibiotics and retrieve many 
relevant terms as can be seen in Figure 4-6. 

The usefulness of the "both" truncation mode can be seen in 
the case where a user interested in organometallic compounds-- 
especially those containing tin--might specify both left and right 
truncation by putting an asterisk on either side of the term tin 
in his profile. Thus, the search term *TIN* would retrieve the 
compounds: tetraphenyltin , triethyltin, and bistributyltinoxide . 

When truncating, one has to be careful not to use term 
fragments or letter groupings that occur frequently in unrelated 
words. In order to avoid inappropriate truncations and identify 
beforehand those candidate search terms that might produce irrel- 
evant hits, we have prepared a KLIC (Key-Letter-i n -Context) 

Index* for each data base in use at IITRI. See Section 7 for de- 
tails . 

*Note : The KLIC Index was first developed at the University of 

Nottingham in England. 
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Infix truncation permits search for any variable 
fragment of a term between prefix and suffix. 

A * B 

Examples of its usefulness in chemical literature: 

electron - * - resonance would retrieve 

electron - spin - resonance 
electron - paramagnetic - resonance 

tri * cobaltate ( II ) would retrieve such 

compounds as 

trioxalato cobaltate ( II ) 
tric hloro cobaltate ( II ) 
trii odo cobaltate ( II ) 

glucose - * - phosphate would retrieve both 

glucose - 1 - phosphate 
glucose - 6 - phosphate 



Figure 4-5 
INFIX TRUNCATION 



38 

55 



* 



Use of the term *MYCIN for antibiotics retrieves 

ACTOMYCIN 

ANTIMYCIN 

BIOMYCIN 

ERYTHROMYCIN 

NEOMYCIN 

STAPHYLOMYCIN 

STREPTOMYCIN 

and many others 

One search term *MYCIN substitutes for 20 to 30 
specific terms. 

Use of simultaneous left and right truncation would 
pick up all of the above terms plus the plural forms. 



Figure 4-6 
LEFT TRUNCATION 



ERjet 
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4.2.3 Lin king or Grouping of Terms 

In order to simplify the writing of a profile, similar 
or semantically related terms may be linked together in a 
single expression by a link code. Terms that are semantically 
associated can be linked together in a single expression. 

That is, several terms that are synonymous, related, or 
hierarchically broader and narrower, can be represented by 
a single alphabetic character. This simplifies the user's 
task of writing his logic expression. He can merely specify 
a link designator rather than indicate the multiple terms 
joined by the link in cases where any one of the terms would 
be equally satisfactory in the logic expression. For 
example, a user interested in reactions of halogens and 
alkali metals would use the terms listed below and assign 



the link codes 


"A" and "B". 






Terms 


Link Code 


Terms 


Link Code 


Halogen 


A 


Alkali metals 


B 


Halide 


A 


Lithium 


B 


Fluorine 


A AND 


Sodium 


B 


Chlorine 


A 


Cesium 


B 


Bromine 


A 


Potassium 


B 


Iodine 


A 


Rubidium 


B 


In writing his 


logic expression 


he would not have 


to specify 



the terms: 

(Halogen I Halide I Fluorine | Chlorine I Bromine I Iodine) 

and 

(Alkali metals I lithium | sodium | cesium I potassium | rubidium) 
He can merely specify 

(A & B) 

4.2*4 Logic 

An effective profile requires not only the use of 
appropriate search terms but also that the terms be related 
to each other in a manner that correctly represents the intent 
of the search question. The relationships are expressed in 
the algebra of logic, called Boolean algebra. Three logic 
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operators are used to indicate the relationships between 
search terms: AND, OR, and NOT. The logic symbols used 

are as follows: 

Logic Operators Symbol 

AND & 

OR I 

NOT "1 

AND logic, designated &, will retrieve an item only if 
both terms connected by the AND operator are present. 

The & operator is the familiar conjunction or inter- 
section of mathematics and engineering in which it can be 
represented by x, or A. 

OR logic, designated I , will retrieve an item if either 
one or both the terms connected by the OR operator are 
present . 

The | operator is the familiar inclusive disjunction 
or union of mathematics and engineering in which it can be 
represented as +, , or V . 

NOT logic, designated -t , will cause items containing 
a term designated by the NOT operator to be rejected. 

The operator is also referred to as complement or 
negation and can be represented by (overline) or 1 

Because NOT is a unary operator relating to only one 
term, it is necessary to always precede the NOT operator 
with an AND operator in writing a logic expression. Thus 
the logic expression for a search for a compound having no 
nitrogen and containing oxygen or carbon would be written as: 
oxygen OR carbon AND NOT nitrogen. 

Parentheses can be used to limit the effect of the NOT 
term. In the expression 

A Mb I C) & —i d 

if D is present, the entire expression is false. I n the 
modified expression 

A & (B | (C & —i d) ) 
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the expression will be true if A and B are present even if 
D is also present. 

Terms connected by AND or OR are not affected by sequence. 

Thus , 

A & B = B & A 
A I B -• B | A. 

Similarly, AND or OR are not affected by grouping. Thus, 

(A & B) & C = A & (6 & C) 

(A I B) I C = A I (B I C) . 

It should be noted, however, that the placement of 
parenthesis in a mixed expression can alter the logic. 

A & (B I C) is not the same as (a & B) I C. 

Several laws of logic may be helpful in determining the 
consequences of writing elementary logic expressions. 

By the law of absorption: 

A & (A | B) = a 

A I (A & B) = A. 

By the law of distribution: 

A & (B | C) = (A & B) | (A & c) 

A | (B & C) = (A | B) & (A I C) . 

By the law of duality: 

1 (A & B) = 1 A | 1 B 

' (A | B) = 'A & 1 B . 

The logic can be written in any free form Boolean 
expression. To avoid logical ambiguity, however, parentheses 
should be used freely. There is no restriction on the 
number of parentheses used; care should be taken to ensure 
that the number of left parentheses equals the number of 
right parentheses. The logic expressions for profiles can be 
as specific and involved as is necessary to express the user's 
question. While most expressions are relatively simple, any 
expression can be handled by the system. For example, the 
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following expression would be legitimate: 

(((A&B) | (C I D I E I F)) &”* G) I ( (H&I) & “» J) 

However, experience indicates that useful retrieval 
can be achieved with a simple logic expression, whereas an 
overly complex expression may obscure a question and result 
in poor retrieval c 

4 c 2 „ 5 Weights 

CSC profiles permit the assignment of weights by users 
to further refine their profiles. Weights are numerical 
values from 0 to 9 assigned to terms to specify their re- 
lative importance. If a user employs weights in his profile 
the output is arranged in descending weighted order so that 
those citations with the highest weights--presumably the 
references that are of most significance to the user--will 
be on top. Since the output of a search will be limited to the 
specified maximum number of hits, the printed output can 
include the highest ranked weights above the cut-off number 
of hits. If the user chooses not to use weights (and this 
is usually the case), the output is ordered either numerically 
by citation number or alphabetically by author (first letter 
of the first author's last name). 

Although the designed purpose of weights wa.s to allow 
further specification in a given profile, CSC has found that 
users employ weights in order to separate two or three 

profiles that are submitted as one profile for one subscription 
fee . 

4.3 Profile Format 

After a profile has been written and checked by the 
profile coordinator it is keypunched. The keypunched profile 
consists of a header card, a group of term cards, and one 
or more cards containing a logic expression. These cards 
have the following internal structure: 



Card 



Columns 



Contents 



Header 



Term 



Logic 



1-10 

11-13 

14-16 

17-19 

20-22 

23-25 



26 

27 

28-29 



1-10 

11-13 

14 

15-16 

17 



18 

19-38 

49 

1-10 

11-12 



Profile number 

Number of terms 

Number of links (a link is 
a group of disjoint terms) 

Maximum number of cards to 
be printed 

Minimum number of terms 
necessary to satisfy the 
logic expression 

Private Libraries usage 
(contains 'PRI' if output 
is to be placed in a 
Private Library) 

Output medium (C=cards 
P=paper) 

Number of copies (#=1, 
else 1-9 permissible) 

Sort type for output 
(AN=ascending citation 
number order, WT=descending 
weight order, 03=author) 

Profile number 
Term number 

Truncation mode (0=none, 
l=left, 2=right , 3=both) 

Type of field to be tested 

Link (terms with the 
same letter in this place are 
OR'd together. The link 
letter may then be used as 
an operand in the logic 
expression) 

Weight of term (if weights 
are used) 

The term 

On last term card this 
position is ' 1 ' 

Profile number 

Minimum number of terms 
that must be found to satisfy 
the logic 
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Card 



Columns 



Contents 



( 



Logic 13-59,60 The logic expression, con- 

sisting of terms (3-digit 
term numbers), links 
(single characters), Boolean 
operators ("&", ' I * , ), 

and parentheses. If the 
expression takes more than 
one card, all but the last 
card have a * 1 * in position 
60. The last character 
of the logic expression is 
$ • 

4 „4 Profile Modification 

The problem of preparation and modification of search 
profiles has undergone careful investigation at the Center 
in light of the relevant statistics and the summary of 
experience obtained from the pilot group of users. The 
best profile is prepared when the person writing the profile 
understands three things: the intent and terminology of the 

search problem, the contents of the data file, and the 
characteristics of the search system. Ideally, the user 
who has the best understanding of his problem should become 
familiar with the contents of the data file and the search 
system so that he can write an effective profile. In lieu 
of that, if he is unwilling or unable to do so, the res- 
ponsibility is assumed by a middleman either at the 
user's institution or at the Center. 

At CSC we have handled profiles prepared in all three 
ways. As would be expected, in the cases where the user took 
sufficient interest to learn the system and write and modify 
his own profile, the result was a good profile and a satis- 
fied user. Good results were also obtained when the user 
took sufficient time and interest to fully explain his search 
problem to a company or Center profile coordinator. 

In preparing a profile for an SDI run, care must be 
taken to include not only the terms that describe the user's 
interests but also all synonyms for those terms used in the 
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vocabulary of the particular data base to be searched. 

Omission of similar terms may result in a loss of pertinent 
articles. A logic expression combining those terms must also be 
developed that will not be too general or too restrictive. 

Since the preparation of a profile for an initial run may 
not completely describe the user's interest, it is usually 
necessary to modify the profile three times to correct 
omissions of terms and faulty logic. 

The output produced for the first few runs of a new 
profile can be reviewed to help modify the profile. Several 
questions must be considered in making revisions in the 
profile. Are all pertinent articles retrieved by the SDI 
run? This can be answered by a comparison with a manual 
search of the material covered by the SDI run. If there are 
missing articles, the omitted citations must be studied for 
additional terms and logic to be added to the profile. The 
terms may be present in the profile, but the logic may be 
restrictive. In this case, the logic must be relaxed, but 
at the same time, not overly generalized. Is the SDI run 
producing a great deal of nonpertinent material? This may 
be due to inclusion of terms that are too general, for example, 
ENZYME may be used when the names of specific enzymes would 
bring about more relevant retrieval. The logic expression 
may also be too general and need to be more restrictive. 

These terms might fall into the classification of positive 
hit terms tied to the logic by the AND logic operator or 
modifying terms may need to be of the negative type. These 
terms would cause a citation to be rejected if the negative 
words appeared in the citation. 

It is possible that some questions submitted to an SDI 
system are of such a nature that much nonpertinent material 
must be retrieved in order to gather the citations that are 
of definite interest. In contrast, it is also possible that 
a subject may be so new or esoteric that little has been 
published. This type of question may legitimately produce 
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very small quantities of output with very few articles of 
real interest. 

Based on our experience between September 1969 and June 
1972 and our observation of user preparation and modification 
of profiles, we have come to the conclusion that although 
users can be trained to write their own profiles, the user 
who conscientiously revises and updates his own profile under 
his own impetus is the exception rather than the rule. 

CSC experience indicates that since it requires almost as 
much time to check a user-written profile as to write it, 
it would be more advantageous to write the original profiles. 
CSC would then be in a better position to revise profiles 
for the users. CSC profile coordinators are closer to the 
data bases, can recognize data base content changes more 
rapidly than the users can, and hence can respond by changing 
profiles accordingly. 

Several user aids have been prepared by the Computer 
Search Center to assist the staff and users in developing, 
evaluating, and modifying search profiles. These are 
described in Section 7 — User Aids. 
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5. SOFTWARE SYSTEM 

The CSC software system was designed to accomodate a 
variety of types of users with a variety of types of data 
bases that would meet their needs. Search programs for 
handling machine-readable data bases are expensive to de- 
velop and expensive to maintain. Since we had no desire to 
incur the expense of maintaining multiple search programs, 
we developed a general purpose search program that would 
handle virtually any of the machine-readable data bases 
containing natural language information. 

When handling multiple data bases, one is very likely 
to encounter multiple character sets and multiple character 
codes. The tape formats and record formats differ from data 
base to data base. In fact, they differ within data bases 
that are produced by the same organization. The data elements 
contained on che tapes vary considerably from tape to tape. 

This format variation problem that occurs when handling 
multiple data bases led to the adoption of the standard IITRI 
file structure and preprocessor system described in Section 5.1. 

The general purpose search system carries out the five 
basic functions of preparing profile input, preparing data 
base input, searching the data base for information correspond- 
ing to the user profiles, preparing output for dissemination 
to the users, and maintaining statistics. These are shown in 
a generalized flow chart, Figure 5-1. 

The five basic programs consist of eleven program modules. 
Descriptions of the main programs, constituent program modules, 
and the files by which they communicate with each other are 
presented in Sections 5.3 and 5.5. Flow charts showing the 
interrelationships and interfaces between and among programs 
and files are presented in Figures 5-2, 5-3, 5-4, 5-5, 5-6, and 
5.7. The development of the CSC search strategy is discussed 
in Section 5.6 and logic sys tems--current and projected--are 
presented in Section 5.7. 



48 



f s 



tk 



o 





Figure 5-1 

SDI SYSTEM GENERALIZED FLOW CHART 
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Figure 5-2 

SDI SYSTEM DETAILED FLOW CHART 
PART 1: DATA BASE INPUT 
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Figure 5-3 

SDI SYSTEM DETAILED FLOW CHART 
PART 2: PROFILE INPUT 
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Figure 5-4 

SDI SYSTEM DETAILED FLOW CHART 
PART 3 : SEARCH 
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Figure 5-5 

SDI SYSTEM DETAILED FLOW CHART 
PART 4 : OUTPUT 
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| Figure 5-6 

SDI SYSTEM DETAILED FLOW CHART 
PART 5: STATISTICS 
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Figure 5-7 

SDI SYSTEM DETAILED DATA FLOW 
PART 6 (Optional): PRIVATE LIBRARIES SYSTEM EXTRACTION 
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A Frivate Libraries System (PLS) that is not a basic 
part of the CSC software system is discussed in Section 5.8. 

It is a generalized system for creation and maintenance of user- 
defined private files that may contain virtually any document 
records the user wants to retain. PLS interfaces with the CSC 
system and can automatically accept as input specified output 
from the CSC system. This is another example of the benefits of 
modular programming. 

5 . 1 Data Base File Structure and Preprocessor System 

The requirement for a single generalized programming system 
for processing multiple data bases necessitated the design of a 
file structure that would accommodate all of the variables one 
might encounter in different data bases, such as multiple char- 
acter sets and character codes, differing tape formats and 
record formats, different data elements and different ways of 
representing the same data element. In the TITRI system a differ- 
ent data type code is assigned to each kind of data element 
found on a data base. The data elements found in the data bases 
we are now using are shown in Figure 5-8. 

Each data base that is to be searched is reformatted by a 
preprocessor program that converts the tape into our standard 
file structure. (See Figure 5-9.) After reformatting, each 
record is composed of a key, directory, and character string. 

The key contains the volume, issue, and citation number as given 
by the data base supplier, and the directory identifies each 
type of element contained in the record according to IITRI data 
type codes. The string contains the data. 

In the directory the data type code is followed by the start- 
ing position for the actual data and an indication of the number 
of characters required by the data. Thus, in Figure 5-10 for 
the record having Citation Number 81368 of Volume 74, Issue 
16, in Chemical Abstracts Condensates there is a CODEN that 
starts in position 1 and is 26 characters long. The 



56 



76 



next kind of data element included in the record is a Journal 
Name which has a data type code "04". The actual data starts in 
position 27, one position beyond the end of the CODEN data, and 
is 14 characters long. The next data element is the 
title which has data type code "02" and starts in position 41, 
one position beyond the end of the journal data. The title 
data is 76 characters long, and the rest of the data are re- 
corded in a similar fashion. Following on through Figure 5-10, 

the format becomes obvious. The string portion of Figure 5-10 
shows how the actual data for this particular reference is 
contained in IITRI format on tape and the complete record, which 
appears in the lower portion of Figure 5-10, shows the entire 

key, directory and character string for the particular record 
as it appears on tape. 

The use of data element codes allows us to handle mul- 
tiple, varied data elements. The system also allows us to add 
new data elements and new data type codes as they arise. We 
have no way of knowing what new data elements suppliers may 
include in their tapes a few years from now. However, we 
have allowed for 2 -1 different data type codes. it is un- 

likely that we will be unable to accommodate any new data 
element that may come into existence. 

The standard IITRI format is employed for any data base 
processed. Our method for handling multiple data bases is to 
write a preprocessor program for each different data base that 
is handled in the system. The preprocessor program reformats 
the data that is contained on the supplier data base and puts 
it into IITRI format. In that way every data base looks the 
same to the search program, and all data bases can be handled 
by one and the same search program. 

The preprocessor or format conversion programs are re- 
erred to in the CSC system as FORCON programs. Details re- 
garding the development of the format conversion programs for 
a variety of data bases are given in the following section. 
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IITRI Data Type Codes 



( 

f 

■T 

I 



j 



I 

1 

I 



© 

ERLC 



Data Element 



Source information 


01 


CODEN 




Journal reference 




Pagination 




Dates 




Title of article 


02 


Author (s) 


03 


Short journal title 


04 


Keyword (s) 


05 


Index terms 




CA section number 




CA Registry number 


06 


Molecular formula 


07 


Corporate author 


08 


Abstract text 


09 


BA CROSS code 


10 


BA biosystematic index 


11 


El Card-A-Lert Code 


12 


Publication information 


13 


Original language 




Availability 




Publisher 




Price 




Parent journal 




Original abstract source 




CA cross reference 


14 


Patent priority class 


15 


Secondary source 


16 



Figure 5-8 

DATA ELEMENTS AND IITRI DATA TYPE CODES 
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Figure 5-9 
PREPROCESSOR SYSTEM 
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Ke^; 



7416-081368 



(Volume, Issue 
and Abstract Number) 



Directory : 



1 


1 


26 


(CODEN) 


4 


27 


14 


(Journal ) 


2 


41 


76 


(Title) 


3 


117 


60 


(Author (s) ) 


8 


177 


51 


(Corp. Author) 


5 


228 


40 


(Index Terms) 


13 


268 


17 


(Language) 



String : 

JPCHAX/75/3/325-30/000071/J . PHYS. CHEM. VIBRONIC EFFECTS IN 
THE INFRARED SPECTRUM OF THE ANION OF TETRACYANOETHYLENEDEVLIN , 

J. PAUL $MOORE , JESSE C. $ SMITH, DONALD$YOUHNE , YOUNG$DEP. CHEM., 
OKLAHOMA STATE UNIV. , STILLWATER, OKLA. $CA07 3000$ IR SPECTRA 
ALKALI METAL SALTSORIG. LANG.: ENG 

Complete Record Appears on Tape a s : 

7416-081368 1 1 26 4 27 14 2 41 76 

3 117 60 8 177 51 5 228 40 13 268 

17 JPCHAX/7 5/3/3 25- 30/0000 7 1/J. PHYS. CHEM. VIBRONIC EFFECTS 
IN THE INFRARED SPECTRUM OF THE ANION OF TETRACYANOETHYLENEDEVLIN, 
J. PAUL $MOORE , JESSE C. $SMITH, DONALD$YOUHNE , YOUNG $DEP. CHEM., 
OKLAHOMA STATE UNIV., STILLWATER, OKLA. $CA07 3000$ IR SPECTRA 
ALKALI METAL SALTSORIG. LANG.: ENG 




Figure 5-10 

IITRI FORMATTED CITATION 
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5.2 Format Conversion Programs 
5.2.1 Variability of Data Base Format 
In the course of developing the CSC system we have ex- 
amined numerous data bases, both to determine the feasibility 
and cost of converting them to our format for searching and 
to determine whether sufficient user interest exists to 
warrant marketing them. Among those we have studied are: 
Biological Abstracts Previews (BAP) 

Chemical Abstracts Service (CAS) data bases: 

Condensates 

Integrated Subject File (ISF) 

Chemical Industry Notes (CIN) 

Chemical Titles (CT) 

Chemical-Biological Activities (CBAC) 

Polymer Science and Technology (POST) 

Engineering Index (El) COMPENDEX 

Educational Resources Information Center (ERIC) 

Food Science and Technology Abstracts (FSTA) 

Government Reports Announcements (GRA) 

Institute for Scientific Information (ISI) 

Institution of Electrical Engineers (England) (INSPEC) 
Medical Literature Analysis and Retrieval System (MEDLARS) 
Metals Abstracts Index (METADEX) 

Searchable Physics Information Notices (SPIN) 

Further information on these and other machine-readable data 
bases is contained in the Association of Scientific Information 
Dissemination Centers (ASIDIC) Survey of Information Center 
Services . ^ 

Despite several proposed standards for tape data bases, 
including those of the Committee on Scientific and Technical 
Information (COSATI), the American National Standards Institute 
(ANSI), and the International Standards Organization (ISO), 
no one format is in general use. Several of the publically 
available data bases are "based on" standards, but none can 
claim exact adherence. Many data bases have adopted the 
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directory-plus -s tring organization, but organization and 
contents of the directory, data tag values, character 
codes, and control information vary widely. Since most 
standards do not include data element tag values (the 
codes which specify the contents of a given field), even 
those data bases designed around the same standard may 
use widely different codes. Some suppliers have designed 
hierarchies of codes (e.g., in the 1NSPEC data base, the 
type 3xx data elements are identification codes such as 
310 for CODEN, 320 for ISBN, etc.) while others assign 
codes in random fashion (e.g., CAS uses sequentially 
assigned numbers to handle new data types). Since the 
standards include a header that describes the format of 
the directory, not only the code values but the code for- 
mats can differ. One supplier might use a three-digit 
numeric code and another a five-digit code. Some suppliers, 
however, have not adopted the directory plus string organ- 
ization. The ISI data base involves fixed-format records, 
with the attendant complications necessary to allow varying 
length data. The CAS data bases, which share a format 
among themselves, use a modified directory plus string 
organization, but also allow short items to be stored in the 
directory itself. In addition, even data bases within the 
CAS Standard Distribution Format (SDF) have significant 
variations. Most CAS data bases use the same data element, 
the Temporary Abstract Number, to associate the physical 
records describing a single citation into a single logical 
record. The CAS-CIN data base, however, does not give 
Temporary Abstract Numbers at all, but uses a different 
data element to make the necessary association. 

5.2.2 Data Base Documentation 

In view of the wide diversity in data base formats it is 
particularly unfortunate that documentation is not very good. 
Although some suppliers, such as CAS and INSPEC, provide 
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complete and detailed information along with examples and print- 
outs, other sample tapes have been received with documentation 
as crude as a six-page Xeroxed description. Often, too, it is 
the data base which is poorly designed or overly complex that 
comes with the least satisfactory documentation. 

5.2.3 Programming 

In order to search a given data base, we first write a pro- 
gram to convert the supplier's tapes into our format. IITRI 
format is a directory-plus-character-string organization, using 
pure binary values in the directory and an EBCDIC-coded character 
string. While this mixed-mode arrangement is undesirable for 
distribution of a data base, it allows much faster access to 
data during processing. Since our format is used only for our 
internal purposes and not for distribution, we can justify this 
somewhat inelegant usage. If we were lo distribute search out- 
put to users or other centers in magnetic tape form (currently 
prohibited by supplier license restrictions), we would convert 
all binary numbers to EBCDIC prior to distribution. CAS uses a 
similar mixing of binary tags and ASCII data on their distribu- 
tion tapes and this mixture of storage modes makes hardware 
translation of ASCII to EBCDIC impossible. We then must expend 
a significant part of our conversion time for that data base on 
software translation. 

The conversion from supplier format to IITRI format is done 
by a separate program for each data base. So far no two data 
bases have been found to be exactly compatible. Generally, how- 
ever, the process of adding a new data base to our capability is 
simple. Most directory-plus-string data bases are similar enough 
that a new format conversion (FORCON) program can be based on an 
existing one. The changes necessary to convert a FORCON for 
USGRDR into one for INSPEC, for example, are relatively minor, 
since they are based on very similar standards. Data element 
tags and storage formats change, but the basic processing flow 
is unaltered. Also data bases from a single supplier may be very 
similar. The various CAS data bases in SDF can be handled by 
very modest changes in the conversion program. In the case of 
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CAS, rhp SDF data types are the same for all the data bases, 
except for a few types unique to single data bases, and storage 
formats are identical. 

The task of writing a format conversion program has two 
parts. The fir?'t, understanding the data base, is always the 
more difficult. The actual writing of the program is almost 
trivial once we are thoroughly familiar with the data base. 

There are four stages of development in acquiring a new data 
base capability: 

Stage 1 Evaluate the contents and format to 
determine complexity of conversion 
and usefulness of data. 

Stage 2 Implement a rough conversion program 
to allow test search and production 
of samples. 

Stage 3 Improve the Stage 2 FORCON for detailed 
testings to allow rough timing estimates 
and extended-period tests. 

Stage 4 Implement a production FORCON, smooth- 
ing out logic and aiming at improved 
execution speed. 

In many cases the results of Stage 1 or Stage 2 indicate 
that no further development is desirable at present. At this 
point we have the knowledge necessary to produce a FORCON or do 
basic tests if user interest develops, but no further work would 
be profitable either because user interest is negligible or im- 
plementation problems are unworkably large. 

If a data base seems to have potential for CSC and Stage 1 
and Stage 2 experience indicates a good data base and a satis- 
factory supplier, then a Stage 3 FORCON is a good investment and 
an extended trial is carried out. The El COMPENDEX tapes, for 
instance, were tested for a year before we made a firm commitment 
to maintain subscriptions. Sometimes Stage 2 can be skipped. 

New CAS data bases, for instance, can be handled with such minor 
changes to existing FORCONs that virtually no preliminary testing 
need be done. The evaluation stage is also drastically reduced 
in such cases. A production FORCON is based on significant ex- 
perience with the data base and incorporates changes and 
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improvements designed to improve operating speed and consistency 
of output. At this point variations from the documentation, 
which virtually always exist, can be corrected. Also at this 
point, special output programs and card formats can be fixed, 
while earlier tests are done with standard or slightly- 
modified ones. 

5.2.4 Status 

Currently we have production-level FORGONs for CA Condensates 
(SDF) , BA Previews, BioResearch Index, and El COMPENDEX. These 
are well-tested programs and their logic flow and object code have 
been carefully analyzed for efficient operation. Test level 
FORCONs have been written for CBAC (pre-SDF), POST (pre-SDF), CT 
(pre-SDF and SDF), CIN (SDF), ISI , FSTA , and INSPEC. These pro- 
grams have been tested and output has been checked for consistency 
and correctness. We are currently evaluating CIN for CAS. We 
plan to offer FSTA beginning in the fall of 1972. INSPEC and ISI 
are being evaluated for marketability. Evaluation-level FORCONs 
have been written for USGRDR, American Mathematical Society, ERIC, 
and many other data bases. These are being reviewed for suitabil- 
ity of contents and difficulty of conversion. Completeness of 
data is also checked (lack of CODEN, corporate author, or other 
data is a drawback) . 

The development of FORCONs and evaluation of data bases is 
a continuing part of CSC's development program. The -esulting 
awareness of features of various data bases is useful in evalua- 
tion of our own system and in counseling our subscribers as well 
as in planning for future expansion to other data bases. In 
addition we can suggest desirable features from data bases we 
evaluate to suppliers of our production data bases. In some cases 
the suppliers are able to add features or revise procedures on 
the basis of our suggestions. 
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5 . 3 Program Descriptions 



The SDI system consists basically of a group of pro- 
grams for handling data bases. The programs communicate with 
each other by means of data files (see Section 5.5). There 
are five basic programs for carrying out the five basic func- 
tions for data base input preparation, profile input prepar- 
ation, search, output preparation, and statistics generation. 
These programs consist of a number of modular programs and 
subroutines. The constituent programs are described below 
together with an indication of the files they use. 

5.3.1 DBCOPY 

DBCOPY (Data Blase COPY program) copies the data base 
tape in the supplier's format to another tape for archival 
storage. Six CA tapes are stored on each archive tape. 

File Name Use in DBCOPY 

DBNAME The data base tape. Scratched 

after copying. 

UFDBvvii The archival copy (vv = vol, 

ii = issue). Kept permanently. 



5.3.2 FORCON 

FORCON (FORm at CONversion program) reads the data base 
tape and converts it from the supplier's format to IITRI for- 
mat. All records dealing with a citation are read in turn 
and the IITRI-format directory and a preliminary string are 
assembled. The final string is formed by extracting portions 
of the preliminary string, performing any necessary transla- 
tions, and concatenating them. The translation and concaten- 
ation are done in BAL subroutines to avoid inefficient PL/1 
object code. 
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DBOCPPRT 



File Name 
UFDBwii 



IFDBvvii 



Use in FORCON 

The data base tape (w = vol, 
ii = issue). Kept permanently. 
The IITRI-format tape (vv = vol 
ii = issue). Kept 1 year. 

The FORCON (and, later OCP) 
listing. Destined for micro- 
film conversion. 



5.3.3 IFCOPY 

IFCOPY (IITRI Format CQPY ing program) takes an IITRI- 
formatted tape and copies it» It is used to produce a single 
file from each volume of a data base. This file can then be 
used for retrospective searches. Approximately 85,000 IITRI- 
format citations (without abstracts) fit on one tape reel. 

File Name Use in IFCOPY 

IFDBvvii The input IITRI-format tape 



5.3.4 DKEDIT 

DKEDIT (Profile Deck EDIT or) scans search profiles for 
errors. Cards are checked for internal errors and the profile 
as a whole is checked to verify the information in the header 
card. The logic statement is checked for consistency with the 
terms and links read. Search terms that would match any of 
the 50 most frequent terms in CA are flagged. 



file (w = vol, ii = issue). 
Kept 1 year. 



IFRFDBw 



The IITRI-format retrospective 
file (w = volume) ; new records 
are appended. Kept permanently 



File Name 



Use in DKEDIT 

Cards The keypunched profile cards 

to be checked (see the Data 
Set Description for PCSCCARD 
for the structure of this file). 



5.3.5 MINIPUP 



MINIPUP ( MINI -Profile Update Program) merges a profile 
update deck into another profile deck. In practice it is 
used to merge changes into a permanent profile stream stored 
on tape. A special card is used to drop profiles without re- 
placement. An intermediate data set used in the process is 
stored on tape and can be used to re-create the output tape 
if it should be lost through machine malfunction during processing. 



File Name 
Cards 



PCSCCARD 



PCSCBKUP 



Use in MINIPUP 

The keypunched cards containing 
profiles to be inserted (see 
the Data Set Description for 
PCSCCARD for the structure of 
this fila). 

The existing profile stream, 
into which the updates are in- 
serted. This file is used 
both for input and output. 

An intermediate work file 
which contains all information 
necessary to create the output 
version of PCSCCARD. 



5.3.6 INPUTR 

INPUTR (Profile INPUT and Reformating program) reads 



the profile stream and builds the data structures used to 
describe the profiles in SEARCH. Terms are aggregated and 
divided into groups by Least Common Bigram. Logic expressions 
are expanded, by replacing links with the disjunction of their 
component terms, and converted to Early Operator Reverse Pol- 
ish form. Profile information blocks are constructed from the 
header cards and other information. 

Use in INPUTR 

The profile stream to be con- 
ver ted . 

The index to the aggregated term 
lis t . 

LCB's for the specific data base. 
Unique search terms, sorted on 
LCB. 

Profile information blocks. 

Logic expressions for all pro- 
files in the run. 

The run communications file, 
used to pass data to SEARCH, 
STIXA, etc. 

5.3.7 SEARCH 

SEARCH reads the profile description structures created 
by INPUTR and uses them to search an IITRI-format tape. The 
search proceeds by reading one citation at a time and deter- 
mining for which (if any) profiles the citation is a hit. If 
the citation was a hit, one copy of the citation is written 
to the hit file for each profile for which it was a hit. After 
all citations have been read a file is written containing the 
information necessary to build the Search Term Frequency/Issue 
listing which accompanies the citation cards. 



File Name 
PCSCCARD 

PCSCTEXT 

DBLCB 

PCSCTERM 

PCSCHEAD 

PCSCLOGC 

PCSCPASS 



69 



83 



File Name 
IFDBvvii 



PCSCTEXT 

PCSCTERM 



PCSCLOGC 

PCSCPASS 



PCSCCITS 



PCSCHITS 



PCSCSTFD 



Use in SEARCH 

The IITRI formatted tape to be 
searched (w = vol, ii = issue). 
Kept 1 year. 

The index to the aggregated 
term list. 

The aggregated term list--all 
terms contained in all profiles 
in the run with duplicate terms 
removed, in order on LCB. 

The profile logic expressions. 
The system communication file 
containing data passed from 
INPUTR and used to pass data to 
OCP and STIXA. 

The citations retrieved (each 
citation that was a hit is in- 
cluded once only, regardless of 
how many profiles found it). 

The hits found- -one record for 
each citation found for each 
profile . 

The data needed to produce the 
Search Term Frequency/Issue 
lis ting . 



5.3.8 DBCARD 

DBCARD (Data Base CARD formatting program) reads the 
file of citations retrieved and builds, for each citation, a 
card image. The format of the card is determined, within 
limits, by the sizes of the various fields of the citation. 
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File Name 
PCSCCITS 



PCSCFMCT 



Use in DBCARD 

The unique citations file pro- 
duced by SEARCH. 

The output card images for the 
citations that were hits. 



^•3,9 DB-OCP (Data Base Output Control Program) 



3.3. 9.1 0CP-0CP1 



0CP-0CP1 (Output Control Program, Step 1 ) makes multi- 
ple copies of the citation card images, one for each hit re- 



corded for each citation. 


File Name 


Use in 0CP-0CP1 


PCSCHITS 


The file of hits written by 
SEARCH. 


PCSCFMCT 


The citation card images written 
by DBCARD. 


POUTPRNT 


The expanded hit-citation file; 
each record is a complete de- 
scription of a hit, including 
the citation card image. 


5. 3. 9.2 0CP-S0RT 1 


0CP-S0RT 1 sorts the 


profile information blocks into 


profile number order. 



File Name 
PCSCHEAD 

PGSCSTHD 



Use in 0CP-S0RT 1 

The profile information blocks 

written by INPUTR. 

The sorted profile information 
blocks. 
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5.3. 9. 3 0CP-S0RT 2 



0CP-S0RT 2 sorts the records in the expanded hit-cita- 
tion file, written in 0CP-0CP1, into citation number order 
within each profile. It also applies any special sorts re- 
quested for output. 

File Name Use in 0CP-S0RT 2 

POUTPRNT The expanded hit-citation file 

written by 0CP-0CP1. 

PCSCSRTD The sorted expanded hit -citation 

file. 

5.3. 9.4 0CP-0CP2 

0CP-0CP2 (Output Control program, Step 2) reads the hit- 
citation file; it inserts into citations the weights and search 
terms found, creates header and trailer cards, applies print 

limits, buiids Search Term Frequency/Issue cards, and writes 
the file of card images that produces the output cards and a 

compressed listing, without blank lines, for COM output. 



File Name 
PCSCSTFD 

PCSCRTD 

PCSCTHD 

PCSCPASS 

PCSC0PLG 

PCSCEXTR 



Use in 0CP-0CP2 
The data used to create the 
Search Term Occurrences listings. 
The (sorted) expanded hit-cita- 
tion list. 

The (sorted) profile information 
blocks . 

The system communication file. 
The printout counts for each 
profile, used by STIXA. 

A file of card images for tape 
output. 



09 

«-• t-mt 



72 



POUTPRNT 



PCSCPRNT 

DBOCPPRT 



The print file of card images. 
This file is printed on 5" x 8" 
cards by an IBM utility program 
or off-line printing unit. 

A file used to hold header and 
trailer card images temporarily. 
The COM listing. The OCP list- 
ing is placed after the FORCON 
listing on this tape. 



5. 3.9. 5 OCP -MICRO 



OCP-MICRO translates the FORCON-OCP listing into a 
form suitable for use with the specific COM unit used to pro- 
duce a microfilm copy of the listings. 



File Name 
DBOCPPRT 
M.CRO. OUTPUT 



Use in OCP - MICRO 

The FORCON-OCP listing. 

The microfilm-format tape 



5.3.10 HITTER 

HITTER generates a list showing, for each citation 
found by each profile, the search terms found in the citation. 



File Name 
PCSCHITS 



PCSCSTHT 



Use in HITTER 

The hit list file written by 
SEARCH. 

The hit list file, sorted on 
profile number. 



5.3.11 STIXA 



STIXA produces a statistical summary for each issue's 
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run. Included are breakdowns of hits and prints by profile 
as well as the sizes of various files, as provided by the 
creating program. (See Production Statistics, Section 10.4 of 
this report.) 

File Name 
PCSCPASS 



PCSCOPLG 



5.3.12 PLSXT 

PLSXT (Private Libraries System Extraction program) 
extracts the output from profiles for which Private Libraries 
oystem (PLS) files are to be created, converts the output to 
PLS format, and merges the result into PLS Master File, a 
copy of the previous Master File is made, to function as a 
back-up to the updated Master File. 

Use in PLSXT 
The hit file. 

The unique citations retrieved 
file. 

The profile information block 
file. 

The updated PLS Master File. 

The new citations that were 
added to the PLS Master File. 

The old PLS Master File (created 
as an emergency back-up file). 



File Name 
PCS CHITS 
PCSCCITS 

PCSCHEAD 

PRIMASTR 

PRIWKOUT 

PRISRTOT 



Use by STIXA 

The system communication file, 
contains various statistics 
supplied by individual programs . 
A file of hit and print counts 
per profile. 
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5 . Core Storage Requirements 

All of the Computer Search Center programs can be run with 
a minimal configuration containing 25oK bytes of core storage, 
two tape drives and one disk. However, modifications of the 
files can change these requirements to some extent. Our current 
sixes are designed for compute -bound operation on an IBM 360/65 
computer, using 300K bytes of core for the largest (SEARCH) pro- 
gram. Core requirements can be decreased with a corresponding 
increase in I/O time for a smaller computer. If the smaller 
computer were also slower (e.g., 360/50) more concurrent time 
would be available for I/O resulting in no overall decrease in 
efficiency. The program requirements in the current operating 
environment are: 

CSC Programs 
DKEDIT 

FORCON 

INPUTR 



SEARCH 



DBCARD 

STIXA 

0CP-0CP1 



<95K bytes 

1 sequential file (disk) 

<110K bytes 

2 sequential files (tape) 

(180K)+ (13 x no. of profiles) + 

(31 x no. of terms) bytes 
1 sequential file (tape) 

7 sequential files (disk) 

1 direct-access file (disk) 

9 IBM SORT/MERGE files (disk) 

(150K)+(547 x no. of profiles)* 

(30 x no. of terms) + ((no. of profiles + 1) x 
(Cno. of terms /8J+1) bytes (note: using number 
of unique terms) 

1 direct-access file (disk) 

1 sequential file (tape) 

4 sequential files (disk) 

<110K bytes 

2 sequential files (tape) 

<85K bytes 

1 direct-access file (disk) 

1 sequential file (disk) 

<80K bytes 

1 sequential file (disk) 

2 sequential files (tape) 
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CSC Programs (continued) 



0CP-S0RT1 


128K bytes 

2 sequential files (disk) 

4 S0RT/MERGE work files (disk) 


0CP-S0RT2 


200K bytes 

2 sequential files (tape) 

16 S0RT /MERGE work files (disk) 


0CP-0CP2 


<115K bytes 

1 direct-access file (disk) 

2 sequential files (disk) 

5 sequential files (tape or disk) 


OCP-MICRO 


< 55K bytes 

2 sequential files (tape) 


PCSXT 


<-l50K bytes 

2 sequential files (disk) 

5 sequential files (tape or disk) 

6 S0RT /MERGE work files (disk) 


IFCOPY 


<65K bytes 

2 sequential files (tape) 


MINIPUP 


<85K bytes 

4 sequential files (tape or disk) 


DBCOPY 


<65K bytes 

2 sequential files (disk) 


IBM Utilities 
IEBPTPCH 


60K bytes 

1 sequential file (tape) 



IEBPTPCH 



5 . 5 System Files and File Structures 

The various programs that constitute the SDI system com- 
municate via a system of files. These files are used to pass 
blocks of data created at each step to the step or steps that 
use them later or to permanently hold data for the archives. The 
data in each file have a carefully-defined structure that is 
used in reading or writing the file. In the following file des- 
criptions we have defined the data structures in terms of the 
PL/l data declarations used to read or write the file. For de- 
tails of the resulting physical data arrangement, see the IBM 
Systems Reference Library publication GC28-8201, PL/l (F) 

Language Reference Manual . 

5.5.1 CA-C0ND . SDF1 (This is an example of a data base tape.) 

Description : 

This tape is the CA Condensates data base tape in 
Standard Distribution Format (SDF) as supplied by 
CAS. It typically contains 5000-8000 citations. 

Format • 

For a description of Standard Distribution Format 

see the following CAS publications: 

Standard Distribution Format, Technical Speci- 
fications (Revised); Condensates in S.D.F. , Data 
Content Specifications 

History : 

Creation: Supplied by CAS 

Referenced: Read by CACOPY 

Disposition: Scratched after copying 

5.5.2 DBOCPPRT (Data Base OCP Print tape) 

Description ; 

This tape contains the FORCON (see Section 5.3.2 of 
this report) and OCP (see Section 5.3.9 of this re- 
port) listings and is used to create a microfilm copy 
of those listings. 
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Format : 

Block of 50 records, each 121 characters long. 

Each record is a line image written to a PL/1 
PRINT file. 

History ; 

Creation : FORCON 

Update: Read by MICRO (the OCP microfilm step) 

Disposition: Held one week, then re-used in same 

capacity 

5.5.3 UFDBvvii (Un^ormated Data Base copy, volume vv, 

Tssue ii) 

Description : 

This tape is a copy of the data base tape in supplier 
format for volume vv, issue ii . 

Format : 

This tape is in the supplier's format. For details 
see the documentation for the specific data base. 

Number of issues per copy reel will vary with data 
base . 



5.5.4 IFDBvvii (IITRI Format Data Base tape, volume vv 

issue Ti) 

Description : 

This tape is an IITRI-format equivalent of UFDBvvii, 
where vv — volume number, ii = issue number. This is 
the SDI search tape. 

Format : 

Varying length records up to 4000 characters long in 
blocks up to 4000 characters long. For a complete 
description of IITRI format, see Section 5.2 of this 
report . 



History : 



Creation : 
Referenced : 
Disposition : 



Written by DBCOPY 
Read by FORCON 
Kept permanently 



History : 

Creation: Written by FORCON (six issues per 

reel) 

Referenced: Read by IFCOPY 

Read by SEARCH 

Disposition: Kept one year 

5.5.5 IFRFDBvv (IITRI Format Retro File Data Base, volume vv) 

Description : 

This tape is a retrospective data base containing IITRI 
formatted tapes for a given volume. Typically a vol- 
ume takes slightly more than two reels, though this 
varies according to data base. 

Format : 

Varying length records up to 4000 characters long in 
blocks up to 4000 characters long. For a complete des- 
cription of IITRI format, see Section 5.2 of this re- 
port . 

History : 

Creation Written by IFCOPY (DISP-MOD) 

Disposition: Kept permanently 

5.5.6 M. CRO. OUTPUT (Mi cro film Output) 

Description : 

This is a tape formatted for input to a COM system for 
production of microfilm output. It contains the FORCON 
and OCP listings for an issue's run. 

Format : 

This is an 800-bpi tape. The format is specific to 
the COM unit used. 

History : 

Creation: Written by 0CP2 

Disposition: Re-used for next issue after pro- 

duction of microfilm. 

5.5.7 PCSCBKUP (PCSCCARD Backup) 

Description : 

This is an intermediate file used in MINIPUP (see 
Section 5.3.5 of this report). It contains the 
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results of the merge, but not the run header card. 

It provides a back-up to the output tape which can be 
retrieved by MINIPUP using a control card. 

Format ; 

Blocks contain 80 fixed-length 80-character records. 

The order and delimiters are the same as for PCSCCARD 
except there is no run header card. 

History : 

Creation: Written by MINIPUP 

Referenced: Read by MINIPUP 

Disposition: Saved until next issue, then re-used 

5*5*8 PCSCCARD (Profile Card stream) 

Description : 

This tape contains the profile stream for input to 
IN PUT R. 

Format : 

Blocks contain 80 fixed- length 80- character records. 
Each is the image of a punched card. 

The deck consists of: 

( 1 ) 



( 2 ) 

(3) 

(4) 



Run header card - 1 PPPTTTTT ' where PPP = 
number of profiles, TTTTT = number of terms 

£sitions e 3-9 r 3nd Cerm C3rdS ’ ° rdered on 

tehm^card 1 card = ' de-limiterooooooooolast 

Logic expression cards, ordered on positions 



(5) Delimiter card = ' DELIMITER# WtfLAST LOGIC CARD' 
For details of header, term, and logic formats, see 
Sections 5.5.11, 5.5.13, and 5.5.19 of this report. 



History : 
Creation : 
Update : 
Referenced: 
Disposition 



Written by IEBGENER from cards 
Read and rewritten by MINIPUP 
Read by INPUTR 

Updated and used for each issue fin 
the case of CA Condensates' everi 

f^ e pA 1SSUe ’ We USe se P ara te tapes 
for CA even and CA odd issues) F 
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5.5.9 PC SCC ITS (Unique Cita tions retrieved file) 

Description : 

This file contains one copy of each citation retrieved 
by SEARCH. 

Format : 

Blocks of two fixed-length 1251— character records. 

. Records are written and read with the PL/1 structure: 

1 CITATION_RECORD, /* contains one citation from the data*/ 



/* base, in IITRI- format */ 

2 A B ST RACT_NUMB E R character (11), 

/* the unique abstract no. , 11 digits */ 
2 DIRECTORY (60) fixed binary (31), 

/* the IITRI-forinat directory; */ 

/* contains sixty fullword binary */ 

/* numbers */ 

2 STRING character (1000), 

/* the IITRI -format string part */ 

/* a fixed-length version for better */ 

/* processing efficiency */ 



1 History : 

Creation: Written by SEARCH 

Referenced: Read by CACARD 

Read by 0CP1 

Read by PRILIB 

Disposition: Re-used for each issue 

5.5.10 PCSCFMCT (Formatted Citation file) 

Description : 

This file consists of images of 5" x 8" printout cards. 

These are generated by DBCARD from the citations in 
PCSCCITS . 

Format : 

Blocks contain two fixed-length 2411-character records. 
Each record consists of an 11-character citation number 
and 38-character line images. 

History : 

Creation: Written by CACARD 

Referenced: Read by 0CP1 

Disposition: Re-used for each issue 
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5.11 PCSCIIEAD ( Heade r information file) 

Description : 

This file contains profile information extracted in 
INPUTR from the profile header card and the profile 
itself. 

Format : 

Blocks contain 100 fixed-length 26-character records. 

Records are read and written with the PL/1 structure: 

1 HEADER, /* this is a HEADER information block*/ 

2 PROFILE_NUMBER character (10), 

/* the ten-character user i.d. number*/ 
2 WF.IGHTJTHRESHOLD fixed decimal (5), 

/* the minimum weight required for a */ 
/* citation to be retrieved */ 

2 OUT PUT_DEF1NIT ION character (4), 

/* position 1: output medium (C or P)*/ 
/* position 2: number of copies */ 

/* positions 3-4 : sort type for output*/ 
2 PRJ.NT_LIMIT fixed decimal (5), 

/* maximum number of citations to be */ 
/* printed 

2 S0RT_FIELD_LENGT1I fixed decimal (5), 

/* length of the field selected for */ 
/* the output sort */ 

2 EXTRACT I ON JlEOltEST character (3), 

/* contains ' PRI ' if the profile's */ 
/* output is placed in a private */ 

/* library */ 



History : 

Creation : 
Referenced : 



Written by INPUTR 
Read by SEARCH 
Read by 0CP.S0RT1 
Read by PRILIB 

5.5.12 PCSCHITS ( Hits Recorded) 

Description : 

This file contains one record for each citation found for 
each profile. It is used to construct the output stream 
from the file of unique citations. 

Format : 

Blocks contain 20 fixed-length 148-character records. 
Records are read and written with the PL/1 structure: 
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1 HIT_RECORD, 



/* describes one hit (i.e. a single */ 
/* citation matching a single profile) */ 
2 PROFILE_NUMBER character (10) , 

/* the ten- letter user i.d. number */ 

2 HIT_WEIGHT fixed decimal (5), 

/* the retrieval weight of the citation*/ 
/* for this profile */ 

2 CITATION_NUMBER character (11) , 

/* the abstract number of this citation*/ 
2 SORT_FIELD character (45), 

/* the actual string that will be used */ 
/* as a sort key in ordering the output*/ 
2 SEARCH_TERMS character (79), 

/* a string containing the search terms*/ 
/* that were present in the citation */ 
/* from this profile */ 



History : 

Creation : 
Referenced : 



Disposition : 



Written by SEARCH 
Read by 0CP1 
Read by HITTER 
Read by PRILIB 
Re-used for each issue 
5.5.13 PCSCLOGC (Profile Log ic Expressions) 

Description : 

This file contains the logic expressions for the search 
profiles. The expressions consist of terms--represented 
by term numbers in the Aggregated Term List- -and 
operators and are in Early Operator Reverse Polish (EORP) 
form. 

Format : 

Blocks contain a single varying-length record with a 
maximum length of 4630 characters. The PL/1 structure 
used for these is: 

1 LOGIC_EXPRESSION , /* the internal representation of a 

/* profile's logic expression 
2 THRESHOLD_TERMS fixed binary (31). 

/* the minimum number of terms which 
/* must be found for the logic to be 
/* satisfied 

2 BIT _ARRAY (EXPTOT) bit (1), 

/* a vector containing one bit for 
/* each term in the A.T.L. The size 
/* of the vector (EXPTOT) is read at 
/* initialization time 



*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

*/ 
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2 PROFILE_NUMBER character (10), 

/* the ten-letter user i. d. number */ 
2 SORT_OPTION character (2), 

/* the output sort type; WT for a */ 

/* sort by weight, AN for abstract */ 

/* number order, 03 for author order */ 
2 THRESHOI,D_WEIGHT character (3), 

/* the minimum weight necessary for */ 
/* retrieval, a zoned-dec imal number */ 
2 NUMBER_0F_SYMB0LS character (3), 



/* the total number of terms and */ 

/* operators in the expanded logic */ 

/* expression, a zoned-dec imal number */ 
2 EXPRESSION character (4200) varying, 

/* the actual terms and operators, in */ 
/* seven-letter blocks containing the */ 
/* term number (in the A.T.L.) and */ 
/* weight for terms, or operator code */ 
/* for Boolean operators */ 



History : 

Creation: Written by INPUTR 

Referenced: Read by SEARCH 

Disposition: Re-used for each issue 

5.5.14 PCSCOPLG (0ut£ut Logging File) 

Description : 

This file consists of accounting information for STIXA. 

One record is written for each profile to generate the 
breakdown of hits and prints by profile. 

Format : 

Blocks contain 200 fixed-length 16-character records. 

Records are read and written with the PL/1 structure: 

1 0UTPUT_L0G_REC0RD , 

/* this record contains the number of */ 
/* hits recorded for a given profile */ 
2 PR0FILE_NUMBER character (10), 

/* the ten-letter user i.d. number */ 
2 PRINT_LIMIT fixed decimal (5), 

/* the maximum number of citations to */ 
/* be printed by this profiLe */ 

2 NUMBER_OF_HITS fixed decimal (5), 

/* the number of citations retrieved */ 
/* by this profile */ 
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History : 

Creation: Written by 0CP2 

Referenced: Read by STIXA 

Disposition: Re-used for each issue 

5.5.15 PCSCPASS (Run Information Pass ing File) 

Description : 

This is the system communication file. It contains six 
values which are set and read at various points in the 
system. 

Format : 



This is a direc t -access file with REGIONAL (1) organiza- 
tion. Each record is a fullword binary number. The 
entries are : 

(1) Number of citations searched 

(2) Number of profiles 

(3) Number of terms 

(4) Number of unique terms 

(5) Number of unique citations retrieved 

(6) Volume and issue (vvii, a 4-digit number) 



History : 

Creation: Permanently set-up 

Updated: Written from INPUTR 

Written from SEARCH 
Written from 0CP2 

Disposition: Re-used for each issue 

5.5.16 PCSCSRTD (Sorted expanded hit -citation file) 
Description : 

This file is a sorted version of P0UTPRNT^ (q.v.), the 
expanded hit-citation file. The sort is on profile 
number and output sort field, as selected by user. 
Format : 

Block contain 2 fixed-length 2548-character records. 
For detailed record format see: P0UTPRNT 1 . 



85 



1C5 



TBfirfSf f /MfM »wm 



History ; 

Creation: Written by 0CP-S0RT 2 

Referenced: Read by 0CP2 

Disposition: Re-used for each issue 

5.5.17 PCSCSTHD (Sorted Header Information file) 

Description : 

This file is the same as PCSCHEAD (q.v.), but sorted 
on profile number. 

Format : 

Blocks contain 100 fixed- length 26 character records. 

For record format see: PCSCHEAD. 

History : 

Creation: Written by 0CP-S0RT1 

l;letenced: Read by 0CP2 

Dl.-rf ^sition: Temporary disk file, released at end 

of run 

5.5.18 PC SCSTI1T (forced HU file) 

Description : 

This file is a copy of PCSCHITS (q.v.), sorted on pro- 
file number. It is used by HITTER to create its listing 
of search terms found in citations retrieved. 

Format : 

Blocks contain 30 fixed-length, 148-characcer records. 
For record format, see: PCSCHITS 

History : 

Creation: Written by sort in HITTER 

Referenced: Read by HITTER 

Disposition: Temporary disk file, released at end 

of run 

5.5.19 PCSCSTFD (Search Term Frequency Data file) 

Description : 

This file contains the counts of the number of citations 
containing each search term in each profile. This data 
is used to generate the Search Term Occurrences cards in 
the output stream. 
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Format : 



Blocks of 200 fixed-length, 13-character records. 

Records are read and written with the PL/1 structure: 

1 STF_DATA , /* an item in the Search Term Frequency */ 

2 PROFILE_NUMBER t character (10), 

/* the ten -letter user i.d. number */ 

2 LINK_NAME character (1), 

/* the letter used to designate the */ 
/* link in which this term appeared */ 
/* if unlinked, then contains */ 

2 TERM character (22) , 

/* the search term */ 

/* asterisks are added at left and/or */ 
/* right ends to show truncation mode */ 
2 COUNT fixed binary (31.) , 

/* a fullword binary number giving the */ 
/* a number of citations in which this */ 
/* term was found */ 

History : 

Creation: Written by SEARCH 

Referenced: Read by 0CP2 

Disposition: Re-used for each issue 

3.5.20 PCSCTERM (Aggregated Term List) 

Description : 

This file contains the aggregated term list and the 
information needed to match the terms against the 
citation search string. 

Forma t : 

The file consists of blocks of 20 27-character records, 
each of which consists of the search term, with the LCB 
prefixed to it, the offset of the LCB from the beginning 
of the term (so that the term can be 'slid' into the 
correct orientation on the citation string) , and the 
truncation mode, expressed as a single character. 

History : 

Creation: Written by INPUTR 

Referenced: Read by SEARCH 

Disposition: Re-used for each issue 



ru 
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5.5.21 P0UTPRNTi 



(Note: this name is used for two 

different files) 



Description : 

This file is an expanded hit-citation list, with the 
citation card image appended to each hit record. 

Format : 

Blocks contain 2 fixed-length 2548-character records. 
Record format is the same as for PCSCHITS except that 
a new element '2 CARD_IMAGE character (2400)' is added 
to the structure to contain the 30 80-character line 
images . 

History : 

Creation : 

Referenced : 

Disposition : 



5.5.22 P0UTPRNT 



-2 



Written by 0CP1 
Read by 0CP-S0RT2 

Tape is re-used immediately by 0CP2 

(Note; this name is used for two differ- 
ent files . ) 

Description : 

This is the print tape and consists of line images to 
be printed by an IBM utility or off-line printer when- 
ever c onvenient . 

Format : 

Blocks contain 60 fixed-length 80-character records. 

Each record is an image of a line of output to be printed. 
Histor y: 

Written by 0CP2 
Read by printing routine 
Re-used for each issue 
( Priv ate Libraries Maste r Tape) 

( Priv ate Libraries Restart Old Tape) 

(Private Libraries Update Work Outp ut Tape) 

Description : 

These three tapes are the Private Libraries Master tape, 
a back-up copy containing the previous Master, and an 
update tape containing the records that were added to 
the previous Master to produce the current Master. 



Creation : 
Referenced : 
Disposition 
5.5.23 PRIMASTR 
PRISRTOT 
PRIWKOUT 
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Format: 



These tapes contain variable-length records up to 
4000 characters long in blocks up to 4000 characters 
long. For detailed format information, see the Private 
Libraries System section of this report. 

Written by PLSXT 

By Private Libraries System (q.v.) as 
necessary 

Disposition: Updated after each SDI system run 



History : 

Creation : 
Update : 
Referenced: 





* 



5 . 6 Search Algorithm Comparison 
5.6.1 Introduction 



A bibliographic search program must perform several 
functions. The first of these is matching search terms to 

^ tion s . Since the CSC Software System is essentially an SDI 
search system, there are several constraints on this matching 
process. For SDI, one assumes a one-time use of the data 
base, so extensive reformatting is not cost-effective. 

Also, we would expect to search many profiles against each 
citation without re-reading the citations for each profile. 
Finally, for data bases with uncontrolled vocabulary (the 
class we are considering) it is necessary to: 1) check the 
types of information in the citation to limit search for 
specific data elements to the appropriate portion of a cita- 
tion, e.g. , to find author terms the search should be re- 
stricted to the author portion of a citation, etc., and 2) 
search on word fragments, words, multiword terms, and phrases. 

If the citation can be divided into terms defined by 
readily recognizable delimiters, for example, words bounded 
by spaces, a number of variant search orders are possible. 

In one variant, each term in each profile is matched against 
each term in each bibliographic record. In a second variant, 
the profile terms can be inverted and each term of a biblio- 
graphic record is sequentially matched against all terms in 
the inverted profile term list. The profile term list is 

passed on as many times as there are terms in the biblio- 
graphic records. 

In a third variant, profiles are inverted and each 
profile term is matched against each bibliographic record 
The bibliographic record Use is passed as many times as there 
are terms in the inverted profile term list. A fourth variant 
is based upon inverting the bibliographic records. This 
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inverted file can be matched against profile terms, either 
profile by profile or in an inverted profile list. 

Each of the above variant search procedures assumes 
that designated delimiters effectively distinguish terms. As 
we will discuss below, this is not wholly true in a complex, 
scientific data base. Term designation precludes phrase 
searching unless individual terms can be concatenated by means 
of a logic operator. An inverted bibliographic term lis t 
precludes left truncation. 

The major problem with all of these methods that de- 
pend upon division of a citation into words is that of defin- 
ing what is to be considered a word. The obvious supposition 
is that a word is any group of letters bounded by blanks. 

The other extreme is that all nonalphanumeric characters be 
considered delimiters. Neither of these options is an 
acceptable solution because in the first case, going from 
blank to blank makes all punctuation part of the preceding 
word (which is sometimes the case, in fact) and in the second, 
the fact that special characters can be parts of words is 
totally ignored. Implementing either of these alternatives 
would make effective profile preparation improbable since in 
the first case all words would have to be listed as truncated 
on the '*ight to allow for punctuation problems (not to mention 
the problems arising with left parentheses) and in the second 
case, even simple abbreviations or hyphenated words would 
have to be entered into the system by individual letters and 
fragments. Although a compromise between the two solutions 
could be programmed, the resulting programs would not be 
general and would be open to continuous revision inasmuch as 
the input is essentially uncontrolled. 

It then becomes apparent that in our SDI system for 
searching data bases with uncontrolled vocabulary, it is nec- 
essary to avoid arbitrary definitions of a term such as 
"characters delimited by blanks". Thus we must define a 
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search term as an arbitrary string of characters to be matched 
against any given portion or portions of a citation. Thus 
the citation cannot be divided into terms but is treated as 
a string of characters and cannot be inverted. The functions 
of a search program then can be listed as: 

• match search terms against the citation string 

• check term types 

e maintain assignment of potential hit terms to 

o respective profiles in the batch 

0 evaluate profile logic expressions 

• prepare hit records for subsequent processing 

In this section we deal primarily with the first of 
these functions, the basic problem of matching search terms 
to citations and coming up with a "yes" or "no" answer for 
the presence of any given search term in any citation. 

However, the method in which this basic task is performed 
will affect some of the other search functions, and we will 
discuss them as they occur. We are assuming that, for reasons 
of one-time use of the data base and presence of uncontrolled 
vocabulary, the data base is to be searched serially, one 
citation at a time, for all the profiles in the system. To 
allow comparisons among the various methodologies that we 
discuss below, we will assume that there are 3000 terms in 
the search term list (200 profiles averaging 15 search terms 
each), 5000 citations on the data base issue being searched, 
and a citation length of 200 characters. These assumptions 
represent a typical SDI run for a data base such as El 
COMPENDFX or CA Condensates. 

In the following sections we discuss the various 
algorithms we have used to develop an efficient method of 
matching search terms against citations. The algorithms are 
presented in order of increasing efficiency. This is also the 
chronological order in which the various methodologies were 
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used by the Computer Search Center. 



5.6.2 Term- to-Citation Algorithm 



The simplest method is to match each term in the 
search term list against each citation, letter by letter. 

The search terms are, of course, sequenced by type, so that 
author terms are matched only against the author portion of 
the citation, etc. In this method, when working on the title 
of a citation, all title terms in the search term list are 
checked against the entire title. In effect, this means that 
the program must take the first title search term and check 
if it matches the title, beginning at the first letter of 
the title. If not, the search term is "slid over" one 
character in the title and checked again. This is repeated 
until the last character of the title is reached. Then the 

whole process is repeated for the next title- type search term, 
etc . 

This is a very simple and straightforward method. It 
can be coded very .easily, In fact, in PL/1 there is a 
built-in function (INDEX) that permits a one -line coding of 
this method. However, it is extremely expensive. For our 
assumed SOI parameters, there would be 3000 times 5000 times 
200, or 3 billion matches required. Although tested for 
information purposes, this highly inefficient algorithm was 
never used in production. 

5-6.3 Basic Ci tation- to-Term Algorithm 

A preliminary analysis of the problem indicates that the 
search should be done from citations to search terms rather 
than the reverse, since there are only some 200 characters in 
the citation and 3000 initial characters in the search term 
list. (Throughout this entire discussion, we are ignoring 
match on the rest of the characters of a search term after 
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the entry character to the citation string has been found. 

This additional character matching is the same for all 
methodologies and so can be eliminated from the discussion.) 

In this basic method, the program goes through the 
citation one character at a time and checks for match only 
those search terms that begin with the character then being 
considered. (The previous division by term type is assumed 
here also, as it is in the balance of the discussion.) The 
reason that this methodology is more efficient than the 
previous one is based on the fact that the search term list 
is divided into groups separated by initial letters. When 
an "A" is found in the citation, only those terms beginning 
with "A" are checked for match, rather than all the terms. 

Since we are working with a character set of 50 characters 
(alphabe tics , numerics, and punctuation symbols), in the ideal 
case our 3000 search term list would be divided into 50 groups 
of 60 terms each. The number of matches would then be 60 times 
5000 times 200 or 60 million . However, there are certain 
characters that are seldom found at the beginning of a word, 
for example, very few words begin with a semi-colon, thus 
in reality the average size of a group is 100 terms rather 
than 60, so that 100 million matches are required. 

This methodology reduces the number of matches to 3.33% 
of those required by the first algorithm. It cannot be coded 
quite as simply, since tables must be maintained to point to 
the position in the search term list of each of the groups 
delimited by a different character and to indicate the number 
of search terms in each of the groups. However, building 
these tables is quite simple and using them does not add much 
to the cost of matching. Truncation is as easily checked as 
in the first case by looking at the character preceding and 
the character following the term after a match has been 

found. We used this algorithm as our first production 
me thodology . 
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5.6.4 Basic Citation- to-Term via Initial Bigrams 



Since the search should proceed from citations to search 
terms, our aim was to reduce the size of an average group of 
search terms and increase the number of groups. This would 
mean fewer matches for each locator in the citation. However, 
it must be accomplished without overly complicating the process 
of getting from the citation to the proper place in the term 
list. In the case above, this was accomplished by an alpha- 
betic grouping of the search terms and a simple table, giving 
a very efficient route. 

Since the first character division method gave us about 
100 terms in the average-sized group, we next tried the initial 
two letters, or initial bigram. Theoretically, the terms would 
be divided in 2500 groups of 1.2 terms each, since there are 
50-squared possible bigrams. But, as unlikely as it is to 
find a word beginning with a semi-colon it is even less likely 
to find one beginning with two semi-colons. In practice, we 
found that half of all the search terms fell into groups headed 
by one of 60 bigrams, and that the actual average group size 
was 20, not the ideal 1.2. This was still a great improvement, 
however, reducing matches to 20 x 5000 x 200 or 20 million . 

This is 0.677, or the first case and better than the initial 
letter method by a factor of five. 

Implementing this algorithm requires a bit more coding. 
Skipping through the character string two characters at a time 
is not too difficult since the first letter of the second bi- 
gram was already found as the second letter of the first 
bigram, etc. The tables to the positions of the terms in 
the list that begin with each bigram are a bit more complex 
since they must be based on two values, one for each 
chacter of the bigram. 

Prior to beginning the search, a bigram table of 2500 
sets of two numbers is set up. A unique value for the bigram 
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is determined by the formula S = 50 - ?2> where P-^ and P 9 

are the positions of the first and second characters, respec- 
tively, in the 50-character string. A character string of 
length 50 is set up containing these 50 characters as ordered 
by frequency of appearance of single characters in the data 
base. The table is filled for each bigram; the first number 
o. each set being the starting position in the term list of 
terms sorted on that bigram and the second number being the 
number of terms sorted on that bigram. For example, CHRSTR 

( ) refers to the letter E, and CHRSTR (9) to the letter H 
The value of the bigram EH, is thus 50 x (1) - ( 9 ) or 41 ' If 

words sorted on EH were the 189th through 197th terms in'the 
erm list TABLE (41,1) would be 189 and TABL£ (412) would 

at th b 1 " 8 ! table 15 TCry l ' apid and is done only once 

at the beginning of the program. Table positions for which 

no terms exist in the term list are set to - 1 . As the search 

proceeds through the citation, two letters at a time, a check 

made in TABLE for each value calculated. If a a -1 is 

for n MBLE Tu CheCk r ^ ' ” 3 P0Sitive Value is found 

TABLE (N 11 H ““ “ are m " le St ° rtinS Wlth the value 
(N 21 All 311 ?? nClnUL " 8 f ° r Che "umber contained in TABLE 
» )• in all, the overhead was not very much higher in 

terms of either machine time or storage requirements, and so 

we put this algorithm into production at a considerable in- 
crease in efficiency. 

5 ' 6-5 — lc Citation-to-Term via Initial Trl»,- am 

ing at hand t ' '"V™ ° f lniCial U “ ar “d bigram match- 
letters) ae ^ WaS C ° <*««■ 

extendld'to th b Pr ° C ” eS C ° Uld b * f ° ll0Wed but would be 

12 5 000 ( 50 3 1 66 ^ aracCer sets - In theory, there would be 

divdd h P ° SS 6 8r ° UPS int ° Which Che berms could be 
, but since we only have 3000 total search terms in 
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the list, the best that could be achieved would be 3000 groups 
with one term in each. In practice, this would not quite be 
realized, since there are some very common trigrams in 
English, such as "THR 11 , "CRE 11 , etc. The average group size 
would be about 1.5. This would result in 1.5 times 5000 
times 200 or 1 . 5 million matches, and would be much better 
than the bigram method. It looked worthy of implementation, 
but the overhead required to maintain the locator tables was 
too high for trigrams. The size of the tables grows ex- 
ponentially. In the initial letter case it had 50 entries, 
for bigrams it had 50 squared entries, and for trigrams it 
had 50-cubed entries. The first two tables fit easily in 
core storage (100 and 5000 bytes, respectively) but the third 
needs 25 million bytes and would have to be compressed to 
fit in core. The compression and coding necessary to decom- 
press each time the table was entered (for each character of 
each citation) made the overhead much higher than the savings 
realized by using trigrams, and so this algorithm was not 
implemented. We estimate that the trigram method would begin 
to show improved efficiency over the bigram method for term 
lists containing 50,000 to 100,000 items. 

5 ' ^ ^ Basic Ci tation- to-Term via Least Common Bigram 

Having determined that searching from citations to 
search terms via initial bigram lookup was efficient, we 
used this method in production for more than a year. However, 
we had never given serious thought as to why we used the 
initial bigram as the locator. Since dictionaries are commonly 
ordered alphabetically on letters from left to right, we 
took this as a natural way to order a word list. In’point of 
fact, selection of any bigram but the first one would have 
divided the search term list into more groups, with fewer terms 
per group, on the average. Having noticed this, it was simple 
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co determine which bigrain should be chosen for any given 
search term. That bigram is the Least Common Bigram (LCB) , 
i.e., that bigrain that appears least frequently in the data 
base. Since we had prepared KLIC Indexes, wo knew the fre- 
quency distribution for all the bigrams in each data base. 
Using the LCB screen technique as search terms enter the 
system, a small routine checks them against the bigram fre- 
quency table for the appropriate data base. For example, the 
word "MOLYBDENUM" contains the bigrams MO, OL, LY, YB , DE, 

EN, NU, and UM. The routine checks each of these in the 
table and finds the LCB, the bigram with the lowest frequency, 
in this case, BD. MOLYBDENUM is then maintained in the search 
teem list undei. the bigram (LCB) BD rather than the initial 
bigram MO. The TABLE for finding the proper position in the 
search term list is maintained and used exactly as it was 
for the basic bigram technique. 

This technique of using LCB's provides two improve- 
ments in efficiency. First, more bigrams are used within 
words than are used to begin words. For example, no words 
begin with "KK", yet "BOOKKEEPING" and other words contain it. 
Thus words are divided into more groups, with fewer words 
per group. In practice, groups average 5 terms in size, 
making the number of matches 5 times 500 times 200 or 5 
million. This is only one-fourth as many as for initial bi- 
grams. The second beneficial feature of the use of LCB's 
is that the largest groups are sorted under LCB's that occur 
least frequently, so the largest groups are searched less 
often than the smaller ones. These two factors combine to 
make an LCB-based algorithm highly efficient. 

The additional machine time to arrange terms by LCB 
is very minimal, and little extra coding is required to 
search on this basis. It is necessary to maintain a number 
for each word that indicates how many characters from the 
beginning of the word the LCB is located, so that proper 
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sr.rppning checks can be made. An example of search using the 
LCB technique follows. 

Once the LCB screen has indicated the portion of the 
search term list to be searched, each term in the relevant 
profile term sublist is compared with the section of text 
indicated by the screen LCB as a possible match for that 
term. The relevant portion of text is delimited by referring 
to a number pair associated with each term. The first number 
tells where the compare area begins in relation to the screen 
LCB under consideration, and the second number gives the 
length of the character string to be compared. For example, 
suppose a title that includes the phrase ...PRESENCE OF 
ALDEHYDES IN . . . is being searched against the term list. 
Taking each bigram in turn as an entry to the list the search 
algorithm will come in due course to the bigram EH and access 
from TABLE the information that entries referenced by the LCB 
EH start at term 526, and that there are nine of them; 

Required Characters 

Term Truncation Preceding Term 



Term 


LCB 


Type 


Term 


Mode 


LCB 


Leng th 


526 


EH 


02 


ACETALDEHYDE 


1000 


7 


12 


527 


EH 


02 


ALDEHYDE 


0010 


3 


8 


528 


EH 


02 


ALDEHYDE OIL 


0010 


3 


12 


529 


EH 


02 


BUTYRALDEHYDE 


1000 


8 


12 


530 


EH 


02 


DEHYDROGENASE 


0011 


1 


13 


531 


EH 


02 


DEHYDROGENAT 


0010 


1 


12 


532 


EH 


02 


FORMALDEHYDE 


1000 


7 


12 


533 


EH 


02 


PROPIONALDEHYD 


1000 


10 


14 


534 


EH 


02 


VALERALDEHYDE 


1000 


8 


13 


Present in 


the core image of the 


term list are 


the 


additional 



numbers written in the sample above. Starting with term 526, 
and using the last two numbers 7 and 12 , the search routine 
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delimits the compare area in the title as follows; 



7 characters screen 
preceding bigram 

A— 






IN... 



...PRESENCE OF ALD E_H Y D E S 

^ y 

total term length 
12 characters 

and performs a compare on the character strings ACETALDEHYDE 
and IriOFfaALDEHYDE . The result is not an equality, so the search 
goes on to term 527 and delimits the title again; 

3 charac ters screen 
preceding bigram 

\ / 



PRESENCE 



OF ALDEHYDES 

N y ' 

total term length 
8 charac ters 



I N . 



This time it compares the term character string ALDEHYDE 
and the text character string ALDEHYDE. This compare indicates 
the existence of a match, so the program goes on to determine 
the truncation modes that are satisfied. Testing the positions 
on either side of the compare area, the program finds a non- 
alphanumeric character (a blank) preceding the term and an 
alphanumeric character S following the term. Thus this 
citation satisfies the requirements for "right” and "both" 
truncation modes, and its found truncation mode is 0011. 
Combining this with the required truncation mode (0010) by 
a logical AND operation gives a nonzero result (0010). Thus 
term 527 is a hit term. The additional terms referenced by 
the LCB EH are then tested by the process outlined above, 
with no more hits resulting. 
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The immediately obvious advantage of the LCB search 
method is that only a subset of the term list is checked in 
each match. Further, since there are only 50 characters in 
the set, although the term list increases in size, the number 
of sets of two numbers remains the same. It then follows 
that a two-fold increase in the size of the term list will 
not result in a two-fold increase in search time. Thus, the 
rate of increase of search time decreases as a function of 
increasing number of terms. This technique shows a time 
savings for more than 120 terms. 

5.6.7 Summary 



The Least Common Bigram algorithm is based on a number 
of discrete steps, each of which gave more insight into 
search algorithms and increased efficiency. It is a very 
good algorithm and is based on the characteristics of the 
data base being searched. The table below summarizes the 
evolution of our search methodology,, 



METHOD 

Terms vs Test 
Initial Letter 
Initial Bigram 
Initial Trigram 
Leas t Common Bigram 



MATCHES/ ISSUE* 

3,000,000,000 
100 , 000,000 
20 , 000,000 
1,500,000** 
< 5,000,000 



*Based on 3000 search terms and 5,000 200-character 
citations 

“*Increase in overhead for processing more than negated 
savings. 
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5 . 7 Logic Evaluation 

The CSC system allows the use of the Boolean operators 
"AND," "OR," and "NOT" nested to any degree to indicate the 
relationship of search terms in a profile. The relation- 
ship of the terms in this way is called a logic expression 
and consists of term numbers (or link characters for those 
terms grouped in a link) , the operators and parentheses to 
indicate the order in which the operators are to operate 
upon the term representations (operands). The profile writer 
is free to use as many parentheses as necessary to express 
the concepts imbedded in the term relationships. 

When one or more of the search terms in a given profile 
is found in a citation, the logic expression for that profile 
must be evaluated to determine if a "true" hit has been found. 
To facilitate machine evaluation of logic expressions, they 
are converted, at profile input time, from the parenthe- 
tical notation used by the profile writer, to an unambiguous 
parenthesis-free notation. One such form of notation is 
called Polish notation, after the nationality of its inventor. 

5.7.1 Early Operator Reverse Polish 

CSC uses the Early Operator Reverse version of this 
notation, commonly called by its acronym, EORP. There are 
also "Late" versions and "Forward" versions, giving a total 
of four combinations, EORP, EOFP, LORP, and LOFP. The 
"Late" and "Early" refer to the relative positions of operators 
and operands, while che "Reverse" and "Forward" refer to the 
direction of evaluation of the expression. 

EORP notation is based on assignment of preference to 
the operators. Thus a program can be written to convert 
parenthetical notation to this form, by replacing parentheses 
that denote operational order with one based on operator pre- 
ference. Consider the two simple expressions: 

(A & B) | C 
(A | B) & C 
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(where, & = AND, | = OR, ”• = NOT) which are clearly not identical. 
In EORP, these would be respectively: 

AB & C I 
AB | C & 

EORP expressions are evaluated by proceeding from left to 
right, performing the operation called for by each operator 
upon the preceding two elements (except for NOT which is a 
unary operator). The result of each such operation is an 
operand in the next stage. To indicate the sequence of an 
evaluation, consider the following expression as an example: 

(((A | B) & (C | D)) & E) 6c ->F 

which becomes, in EORP: 

AB | CD I & E & F*» & 

To show the evaluation, we assume that A, B, D, and E are 
present (we will use "1" for present or True and "0" for 
not present or False). The expression is evaluated as shown 
in the steps below: 

11 I 01 j & 1 & 0-' & Expression with ' 1 ' and 'O' 

1 01 j 6c 1 & 0” & Evaluation of 1st Operator 



1 


& 1 6c O' 


6c 


II 


" 2nd 


II 




1 6c O' 


6c 


II 


" 3rd 


II 


1 


0- 


6c 


II 


" 4th 


II 


1 


1 


6c 


II 


" 5th 


II 




1 




II 


" 6 th 


II 



The result is True. In each line the next operator is 
evaluated, the result "dropped down," and the operator is 
removed. The process is continued until all operators have 
been checked and a True or False answer results. 

The major failing with EORP notation evaluation is that 
the entire expression must be checked before the final result 
is known. Consider the expression: 
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A6c(B| Cj D | E | F | G| H | I I J) 

which in EORP would be 

ABC I D ! E | F ! G | H I I | J I & 

Only when the last & is checked is the result found. Yet it 
is immediately obvious, in the parenthetical notation, that 
is A is not present, the expression is definitely False. 

To get around this drawback of EORP notations, we have been 
considering and testing two alternative logic evaluation 
s y s terns . 

5.7.2 Tree Logic 

An alternative to the EORP logic system would be generating 
a tree to represent the logic expression. Each operator 
would be a node. The expression above would generate the 
tree: 




If A is not True, evaluation ceases immediately. If all 
subsequent false branches are followed, the result is False. 

If any subsequent True branch is followed, the result is 
True. It is necessary to follow the whole tree down 
rather than exiting as True if A and any of the others is 
found to enable detection of all True others. On the average, 
this type of evaluation should allow finishing evaluation 
in half the time required for EORP notation evaluation. 
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However, constructing the trees is difficult especially for 
recursive expressions and those that use the same term more 
than once. We are still testing this technique. 

5.7.3 Modified EORP 

A second alternate logic system would involve retaining 
the EORP notation, but for terms grouped together in a link 
(all terms in a link are implicitly OR'd together), a second 
expression would be generated. An initial evaluation would 
be made of the short expression (the one with only one operand 
per link). Only if the expression were found to be true 
would the entire expression be checked (the one with an 
operand for each term in each link). This system appears 
simpler to implement and we are now running timing tests on 
it. 
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5. 8 Private Libraries System 

5. 8. 1 Private Libraries Sys tem--General 

The Private Libraries System is a software system that 
is used for the creation, maintenance, and searching of 
private files or subset data bases in machine-readable form. 

A private library can be established for an individual, a 
laboratory, a company, or any other organizational unit. 

Input to a private library can be from any source specified 
by the requestor. It may be citations, abstracts, full 
text, or document surrogates containing virtually any kind 
of data element the user wishes to retain. The documents 
may be company reports, literature, references, laboratory 
log books, correspondence files, etc. The data elements 
may be authors, titles, project numbers, key words, index 
terms specified by file users, codes corresponding to any 
meaningful data parameter the user may wish to record, etc. 

The user who has a private file established and main- 
tained for him controls the input to the file. He determines 
what should go into the file, what should be deleted from the 
file, and when the file should be purged. All of the items 
in the file represent his judgments and decisions as to 
relevance. He may wish to have his weekly output from the 
SDI system automatically entered into his private library 
for use at a later date or he may want to look at the output 
to determine which citations should be included and which 
should not. Additionally, he may want to enrich the cita- 
tions by adding indexing terms, codes, or categories that 
have meaning to him or his company such as project numbers, 
product numbers, etc. The net result is a personally 
tailored file in machine-readable form that the user may 
search on demand. 

The Private Libraries System can be adapted to accomo- 
date virtually any existing file of document related data. 
Hence, a company that wants to establish a computerized re- 
trieval system for its files need not go to the expense of 
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time-consuming and costly software design and development 
it can have its files converted to IITRI format and then 
use an existing software system. 

5.8.2 Software System 

The Private Libraries System (PLS) is a group of pro- 
grams which interfaces with the CSC search system to pro- 
vide search facilities for use with SDI output and to 
provide search facilities for user data bases. Use of the 
cross-over capability, however, is completely optional-- 
SDI users need not direct their output to PLS files and PLS 
users need not ever search their libraries. The system 
consists of several components which are listed below. 

• PLSXT--a program which collects output from 
SDI profiles, reformats it, and moves it 
into the Interface Library 

• PRILIB--a collection of program modules to 
create, expand, maintain, condense, and 
list libraries of citations 

• Conversion Programs--a set of programs used 
to build libraries from existing machine- 
readable data bases 



• PLSST--a program to reformat a PLS file for 
searching 

• Interface Library--a PLS-format library of 
citations collected from the search system 
(both SDI and PLS searches) but not yet 
merged into User Libraries 

• User Libraries--a set of PLS-format libraries 
associated with individual users, where one 
library may hold citations for many users or 
one user may own several libraries 

A library is an OS data set containing citations in PLS 
format. This format is derived from IITRI standard format 
by adding the user ID number to both directory and string 
portions of the record along with additional internal items: 
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Position 



Record 1 


1 - 


10 




11 - 

22 


21 




23 - 


262 




263 - 


264 



Record 2 


1 


10 




11 


21 




22 






23 


- 4000 



Con ten ts 

User ID number 
Citation number 
• 1 ' 

The directory (60 fullword 
binary numbers) 

Purge date 

User ID number 
Citation number 
1 2 • 

String portion of citation 



Both records are OS variable-length records, and only 
the meaningful portion of the second record is actually 
present. The records in a library are in order according 
to the first 22 characters of the records. 

5. 8. 2.1 PLSXT 

The PLS Extractor Program (PLSXT) is the search-sys tem- 
to-PLS interface. It reads the file of profile headers 
created by INPUTR and builds a list of those containing the 
code "PRl" in the Security field. It then scans the hit 
and citations files and extracts citations found by the 
selected profiles. The citations are converted to PLS format, 
sorted in ascending order on their first 22 characters, and 
merged (by the same ordering field) into the interface library. 
This program is reasonably fast using about 20 seconds 
of CPU time to extract and reformat 200 citations and merge 
them into a 10,000-citation library. 

Since the regular CSC search system is used for searching 
PLS libraries, this program is responsible for collecting 
the results of searches of PLS libraries and making them 
available to PLS for examination and/or storage. 



PLSXT in no way interferes with the SDI search system. 

5. 8. 2. 2 PLSST 



The PLS Search Transformation Program (PLSST) is the 
PLS- to-search-system interface. It reads a PLS-format 
library, sorts it on characters 11-22 (ignoring the user 
ID number) , and then removes the extraneous information added 
for PLS use. The resulting file can be searched by the 
standard CSC search system. The results of the search can 
be re-entered into PLS by coding PRI in the profile (s) 
used for the search. This program runs rather faster than 
PLSXT, since only the format conversion and sorting is done. 

5. 8. 2. 3 Special Conversion Programs 

While PRILIB (below) supports addition of citations 
to libraries, large collections of data can be added more 
efficiently by independent conversion programs. One of 
these programs is for use with a standard card format, and 
is used for entering data bases which are not in machine- 
readable form. Other users can be accomodated by special 
programs written to suit their specific data base, using a 
combiner /writer module common to all conversion programs. 

Use of this latter routine insures consistent output. 

Running times for these programs varies with the 
complexity of the format being read. The range of speeds 
is similar to that for the CSC FORCON's, though they tend 
to be somewhat faster because sorting the data base in- 
volved is usually simpler than sorting a commercial data 
base. (However, this may not always be true if the data 
base requires unusual and difficult conversion). 

5.8. 2.4 PRILIB 

The PLS core system contains a group of program 
modules. The command interface module reads and passes 
user commands and calls the various modules necessary to 
perform the desired action. The modules include: 

• Command Interface 

• Input--per forms LOAD and MERGE operations 



• Maintenance --per forms add, delete, and alter 
operati ons 

• LI STMON- -generates output for one citation 

• LISTOUT--combines the results of all LISTMON 
calls into a single listing 

• ACCT--keeps track of user statistics 

In use, a library file is read into a temporary disk file 
that becomes the current file. All manipulation is done 
with the current file. Other libraries can be merged into 
it and it can be written, in whole or in part, to create 
new libraries or replace old ones. Maintenance operations 
can be performed on citations or on groups of citations 
and various listings can be generated from all or part of 
the current file. 

^•8.3 User Interaction with Files --Commands 

User commands are file commands (for input, output, and 
listing generation) and maintenance commands (for adding, 
altering, or deleting citations or fields within citations). 
File commands are free-format, consisting of an operation 
keyword, a file name, an ID mask, and a listing command. 

Ail of these except the keyword are optional. The ID mask 
specifies portions of the ID number which must be matched for 
a record to be read or written. The listing command speci- 
fies how the records selected by the ID mask are to be listed, 
if at all. Listing types are bibliographic, tabular, and 
keyword -in- and out-of-context (KWIC and KWOC) ; various sort 
options are available. Commands available are LOAD and MERGE-- 
for input, and PURGE, DUMP, and EXTRACT --for output. The out- 
put commands differ in their effect on the current file--PURGE 
retains all except selected records, EXTRACT retains only 
selected records, and DUMP retains the previous current file 
intact. All three write selected records to the specified 
file. 

Maintenance commands are fixed-format commands consisting 
of a keyword, ID and citation number mask, a field specifier, 



110 



130 



f 



and a new data value (except for deletes). If complete ID 
and citation numbers are given, the program scans forward to 
the specified citation, performs the specified operations, 
and loaves the current file where it is, so that if the new 
commands call for a later citation, previous ones need not 
be re-scanned. If a command contains "don't care" positions, 
specified by asterisks, the current file is reset to the be- 
ginning and all records matching the required portion are 
modified or deleted as specified (obviously, "don't care" 
positions are not permitted in commands to add citations, 
through they are permitted in commands to add fields). The 
command formulas are: 



File Commands 



^command ^ 
* keyword > 



% ( keyword^ ^ specifier^ 

LOAD | MERGE 1 DUMP | EXTRACT 
PURGE | ABSET | MAINT 



< specifier > 
* file spec > 
«: mask spec - 
s lis t spec > 
-list type > 



/ 

TAB I BIBLIO 




KWIC KWOC USER 



<sort options 
<field type > 




AUTHOR | TITLE | CITNO | IDNO 
CODEN | SUBJECT | ( type no \ 



< ID mask > 

< file name> 
<type no> 



string of ten. Or fewer positions 
legal OS/ 360 ddname 
number in the range 0-999 



Examples 



7, LOAD 
7, PURGE 
7o MERGE 



INFILE , **AOl /TAB (AUTHOR) 
OUTFILE , **AO* 
UPDFILE/BIBLIO 
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Con ten ts 



Ma in tenanc e c ommands 



ID number 
Citation number 
Keyword 
Field type 
Iteration of field 
Data 

Sequence number (for continuations) 
Both file and maintenance commands may be continued. For 
file commands continuation cards begin %$, and position 3 
of care n+1 is treated as following position 80 of card n. 
For maintenance commands positions 1-29 of continuation 
cards match the first card, and positions 79-80 contain 
ascending numbers. 

Two special commands are written in file command 
format, but specify conditional, rather than immediate 
output. ABSET is a DUMP command to be performed if PRILIB 
abends or if a command error causes the job to stop before 
all commands are fulfilled. MAINT is a DUMP command per- 
formed as maintenance operations occur, allowing saving of 
pre-modification values and generation of listings during 
maintenance operations . 

5. 8 .4 Libraries 

PLS libraries are OS sequential data sets. Currently 
they have record length 4004, blocksize 4008, and variable- 
blocked format, but shorter lengths could be used. Typical 
citations obtained from SDI runs on CA Condensates total 
about 600 bytes, from El Compendex around 1000 bytes. Fre- 
quently-used files might be kept on disk, but most will be 
tape-resident. CA Condensates citations would fit about 
12 per track (on a 2314) if a different blocksize were 
used to permit m0 re blocks per track. The interface library, 
in particular, might be disk-resident. In normal use PLSXT 
merges citations into this file after each search run* 



Positions 

1 - 10 
11 - 21 
22 - 25 
26 - 28 
29 - 30 
31 - 78 
79 - 80 
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the citations are deleted from the file as users extract 
them for their personal libraries. 

While most libraries consist of bibliographic citations 
(hence the term libraries) there is no restriction on the 
contents of citations. Any character-string data can be 
stored, including numerical values. Listings of citations 
including numerical data can be generated as with other 
data, though no arithmetic or totalling operations are 
supported. Users are permitted to add their own data types 
as desired, subject to CSC conventions, and define special 
listing formats to be generated by using the USER listing 
type. 

5.8.5 Use of the System 

The Private Libraries System affords users a unique 
ability to store their SDI output in machine-readable 
form and access it in various ways, to mix SDI output 
with other data, to add data to SDI citations, and 
to search collections of SDI and other citations with 
CSC profiles. As well as permitting grouped listings, 
as opposed to individual cards, this permits additional 
use of the data. In many cases the utility of the data 
can be increased by adding further information to citations. 
If, for instance, a file is used as a library catalog, 
such data as accession number s , shelf locations, periodical 
renewal dates, and reader comments might be added to 
citations. An example of an added field is the secondary 
citation field which PLSXT adds to citations it creates 

to refer to the journal issue in which the citation was 
found. 

PLS strikes a careful balance between flexibility and 
simplicity. While the file and listing commands support 
the basic data base operations, stand-alone conversion 
programs, the USER listing type, and user-added data fields 
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permit sufficient flexibility to meet a wide variety of 
data base needs. A serendipitous side-effect of the 
modular design and simple data structure is that the 
program can easily be modified to suit special data base 
needs. Many common features that would add needless com- 
plexity can be built in for special applications. This 
combination of flexibility and simplicity provides max- 
imum ability at minimum cost. 
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DnTA BASES--C HARACTERISTICS , STATISTICS, AND COMPARISONS 

In addition to the obvious intended variation in content, 
data bases vary within external characteristics both within 
and between supplier organizations. The variation exists in 
terms of machine code, character code, tape density, labeling 
conventions, blocking factors, content of logical and physical 
records, data elements included, data element conLent, codes 
employed .and format of the tape. CSC analyzed a number of data 
bases lor these items and presented findings in a paper en- 
titled "Comparison of Document Data Bases" which appeared in 
tlle — urnal 21 the American Society for Information Science . 
Volume 22; No. 5, September-October 1971 . 4 Such inccnsisten- 
cies and non-standard representations are accommodated in the 
CSC system by use of format conversion preprocessor programs 
as discussed in Section 5.2. 

We have done further analysis of the CSC production data 
bases CA, BA and El. We have developed statistics and analyzed 
them in order to gain insights into the use of the data bases, 

prepare projections for future storage and searching require- 
ments, etc. 

6 . 1 Data Base Characteristics 

6 • 1 • 1 Number of Citations per Issue 

The number of citations per issue varies from data base 
to data base and often within a data base. BA Previews, for 
example, produces a fixed number of citations per issue through- 
out the volume -- 583.) citations appear on each issue of BA 
and 7500 on each issue of BIORI. 

In the case of CA , over the past three years issues of 
CA Condensates have contained from 3400+ to 8800+ (see Figures 
6-1 through 6-5) with the average in 1969 being 4600+ and the 
average in 1972 being 6000+. El issues range from 4400+ to 
8300+ citations per month (see Figure 6-6) with a small 
percentage due to erroneous citations being recycled. ( No 
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figure is presented for BA because the number is constant.) 

The number of citations directly affects the cost of 
searching and hence the price we must require for subscriptions. 
Thus, a 30% increase that has occurred in CA from 1969 to 1972 
should imply an increase in subscription fee. 

The number of citations affects cost and the cost per 
citation per issue is relatively constant for a fixed number 
of profiles. As the number of profiles increases cost/citation/ 
issue will increase because individual citations are evaluated 
for more profiles. 

6.1.2 Statistics on Length of Citations, Data Fields per 
Citation and Key Words per Citation 

Statistics are given for CA in Table 6-7 showing: the 

number (or average number) of citations on a tape(s) together 
with the mean, standard deviation, and maximum length of the 
citations; average number of data fields per citation; average 
number of key words per citation -- mean, standard deviation, 
and maximum. Note that the mean length of citations (number 
of characters) and the number of data fields/citation are in- 
creased after CAS added two new data fields -— cross references 
and patent priority codes. Also, the number of key words/ 
citation increased in the later issues. This is due in part 
to our inclusion of cross references in the key word portion 
of the 1 1 TRI -format ted tape (because cross references provide 
subject type information) but it also represents an increase 
in the number of key words assigned by CAS. Such data base 
additions affect the center both positively and negatively. 

They increase the cost of processing a tape but also increase 
retrieval capability by providing more locators. 

6.1.3 Percent Occurrence of Data Types 

Ihere are many different data types or data elements 
present on various data bases. Even within a given data 
base that specifies use of certain data types the frequency 



O 

ERIC 



n6 ICiG 



* 



J 



-r- 

J 

r 

i 



C\| 

1^* 



1^* 

w 



.J 

o 

> 

cn 

w 

§ 



o 

o 

cn 

H 

r2 

H 

cn 

PQ 

< 



S 

W 

X 

u 



I 

I 

I 

1 




I 

t 




Figure 6-1 
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Figure 6-2 
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I 




139 



NUMBER OF CITATIONS ON TAPE VS. ISSUE 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 75 



O I 

ERIC 



2 4 6 8 10 12 14 16 18 20 22 24 26 




Figure 6-4 
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Figure 6-5 

NUMBER OF CITATIONS ON TAPE VS. ISSUE 



ENGINEERING INDEX COMPENDEX VOLUMES 71,72 
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STATISTICS ON LENGTH, DATA FIELDS PER CITATION 
KE WORDS PER CITATION IN CA CONDENSATES 



willi which data types appear may vary. In CA, for example, 
titles, CODEN and keywords are present in 100% of the citations; 
authors 99%; journal titles 91-98% etc. (See Table 6-2). 

6 - 1-4 Average Length of Data Entries by Data Types 

The length (number of characters) of a given data type 
will vary as can be seen in Table 6-3. 

6 . 2 Data Base Term and Character Occurrences 

Phenomena about data bases that affect the way in which 
profiles should be written are the frequency with which specific 
terms, letter combinations and letters occur in the data base 
and the variation between data bases of the occurrences of the 
same term. in order to monitor growth of vocabulary in data 
bases, observe differences between data bases and predict the 
degree of specificity of individual profile terms and truncated 
terms at the time of writing profiles, CSC has prepared a number 
of lists for each data base including: term frequency--sorted 

both alphabetically and in frequency order; KLIC indexes; bigram 
frequency lists; and single-character frequency lists. 6 

6*2.1 Term Frequency 

CSC has developed a program for extracting word tokens from 
a data base and sorting them both alphabetically and by frequency. 
These sorted frequency lists are prepared for each data base. 
Figures 6-7 through 6-12 are samples from CA, BA and El alpha- 
betical term frequency lists and frequency ordered term fre- 
quency lists. 

The program, EXTRACT, extracts word tokens from a data 
base in IITRI format. The type of words (e.g., title words, 
author words) to be extracted were defined by the programmer. 
Initially, the program used only blanks. Thus if "X RAY" appears 
in the data base without a hyphen and with an internal blank the 
program will extract "X" and "RAY" as two separate words. Although 
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Figure 6-12 
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the arbitrary choice of blanks as delimiters resulted in some terms 
splitting, it appeared to be a realistic convention, since none 
of the data bases are produced under an explicit set of delimit- 
ing conventions that could be incorporated in EXTRACT, 

This delimiter was used for samples of 2, 6, and 13 issues 
of CA Volume 72. However, later analysis showed that a small 
number of discrete terms could be identified by adding slashes 
and asterisks as delimiters and this was done for 13 issues of 
Volume 73. When we prepared the frequency lists for Volume 75 
we stripped off non-alphanumeric characters from the beginning 
and/or ends of words. 

The second program, SQUEEZ, compresses the extracted word 
tokens into a list of word types (unique words), and maintains 
with each word type a count of the number of times that type was 
found. SQUEEZ makes use of the IBM SORT /MERGE utility program 
to sort all extracted words alphabetically. It compares each 
word to the preceding one in the sorted list, and removes dupli- 
cates, counting each time it does so. The alphabetical list is 
printed out, with the count for each word. (See Figures 6-7, 

6-8 , and 6-9) . 

At this point a program called CLEAN strips off non- 
alphanumeric initial and terminal characters in order to avoid 
listing such terms as HEAT, and HEAT as separate words. 

The program, FREQDT, is used to print out the unique words 
in decreasing order of their frequency of appearance. Again, 
the SORT/MERGE program is used to sort by frequency count, and 
the words are then printed in a one, two, or four column format. 
(See Figures 6-10, 6-11, and 6-12). 

The frequency lists are useful in determining which terms 
are likely to be highly discriminating because of low frequency 
and which terms are likely to have poor retrieval effectiveness 
because of their high frequency. The term frequency lists are 
intended both as user aids and analytical tools. Occassionally 
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term sequences occur that seem to convey other information, for 
example, on one page of the list (see Figure 6-13) the terms 
"tobacco" and "cancer" appear in sequence having frequencies of 
348 and 347. On the same page the terms "pregnancy", "chick", 

"bed", "critical", and "hormones" are listed sequentially having 
frequencies of 373, 372, 371, 371, and 370. 

As might be expected, the prepositions and conjunctions are 
of high frequency, but within the twenty-five words of highest 
frequency are also: EFFECT, REVIEW, ACID, ACIDS, DETERMINATION, 

CHEMICAL, STRUCTURE, PROPERTIES, IRON, and SYNTHESIS. Some of 
these terms can be used for search terms, but should be used 
with care, since they could result in hits on a large portion 
of the file. They should be qualified by incorporation in phrases 
or associated with other terms in the logic expression. 

6.2.2 Type: Token Ratios 

After preparing frequency data we analyzed them to determine 
the number of occurrences (tokens) of unique terms (types). For 
CA , we did a series of these studies, using 2, 6, and 13 
issues of Volume 72 and 13 issues of Volume 73 making a total of 
26 issues. In this way we could get a curve of type itoken ratio 
versus tokens. As would be expected, the type :token ratio in- 
creases with an increased number of citations. Each type appears, 
on the average, 5.48 times in 9000 citations taken from two issues, 
but 12 times for 134,000 citations taken from 26 issues. A summary 
is given in Table 6-4. The curve of type :token ratio versus tokens, 
plotted on a log scale, is a straight line (see Figure 6-14). 

Although it is probably not reasonable to project this line, 
if such is done the indications are that no new types would be 
added once the data base reached 45 million tokens (about 12 
years worth of CA) and we know that there are approximately 100,000 
new compounds (which have names that may be reported in the 
literature) developed each year and there are likely to be newly 
coined words in a growing and changing technological society. 
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Table 6-4 
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Although the absolute number of term types does not increase 
linearly, the rate of increase is constant. 

The type; token ratios for the earlier volumes of CA (72 and 
73) vs. Volume 75 differ probably due to our stripping off non- 
alphanumeric characters via the CLEAN program when we ran Volume 75 
(See Table 6-5). 



CA Volumes 
72 & 73 
75 



No. 



Issues No . Types 
26 153,268 

26 100,220 



No. Tokens 

1,841,432 

2,217,158 



Type;Token Ratio 
1 : 12.01 
1 : 22.12 



Table 6-5 
TYPE: TOKEN RATIOS 



6.2.3 Key-Let ter -in -Con text Listings 

The data base analysis programs discussed above; EXTRACT, 

SQUEEZ , CLEAN, and FREQ DT; are followed by a fifth program, KLICPT, 

which generates a Key-Letter-In-Context (KLIC) index. A KLIC 

is a permuted word listing sorting on each letter in each word 

in the data base with the remainder of the word wrapped around 

it (similar to a KWIC index). The KLIC index is printed with 

the term frequency following each term. Figures 6-15, 6-16, and 6-17 

are sample pages from CA , BA, and El KLICs. 

The KLIC for CA Volume 75 contained 26 issues from July 1, 

1971 through December 31, 1971 and contained 157,995 citations. 

The EXTRACT program extracted 2,217,158 words from the title 
md keyword fields. 73,470 contained non-alphanumeric initial 
and terminal characters that were stripped off by the CLEAN pro- 
gcam. The SQUEEZ program selected 100,220 unique words. The 
1,097,512 KLIC Index entries were sorted in the 12th position 

of a 20 position field and KLICPT was run to write the KLIC for 
printing . 
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Figure 6-17 





El KEY-LETTER- IN-CONTEXT INDEX 



161 



141 



iucmcii 



The El Frequency Lists and KLIC generated from El COMPENDEX 
Volume 71, issues 1, 2, 3, 6, 7, 8, 9, 10, 11, and 12 contained 
1,096,994 words taken from titles and index terms. Of these 
115,669 were stripped of terminal punctuation. Redundant words 
were removed yielding 54,914 unique words for a TyperToken ratio 
of 1:20. The number of KLIC index entries was 515,317. Follow- 
ing the KLIC the El bigram frequencies were prepared. 

6.2.4 Bigram Frequencies 

The CSC search system employs a Least Common Bigram (LCB) 
technique (Section 5.6). LCB's depend on a bigram (2 letter 
combination) frequency list which is prepared following the KLIC in 
dex. An alphabetical list of bigrams (with frequencies) is pre- 
pared as the last step in KLICPT . (See Figure 6-18). A small 
program, PRTLCB (Print LCB 's), was written to print out bigrams 
in 4 column order. One column is printed for each of four bigram 
files (BA, Volume 73 CA, El and Volume 75 CA). If SORT /MERGE is 
run before PRTLCB, the listing is generated in decreasing fre- 
quency order, (See Figure 6^19) . 

Bigram frequency lists are prepared for each data base. 

When printed in frequency order they can be looked at as LCB 
lists. Many of the LCB's for CA , BA, and El rank as low frequency 
bigrams in each data base but their position in terms of frequency 
differ a bit as can be seen in Table 6-6 where bigram frequencies 
for CA Volume 73, El Volume 71 and BA Volume 52 were compared 
with CA Volume 75. 

6.2.5 Single Character Frequencies 

Another small program, CHRCNT (Character Co unt) , is used 
to generate a listing of single-character frequencies. The normal 
listing is alphabetical, but again, if SORT/MERGE is used as a 
prefatory step, the output can be obtained in frequency order. 

The frequency of occurrence single characters as they appear 
in CA, BA, and El are: 

• for BA (based on Vol. 52) 
tfOEAlOTNlSR52CL6MHDU4PF38GY7BVW9XKZQJ.$(+; )*=' ?: ,/- 

• for El (based on Vol. 71) 

EITtfNARSOLCUMDPGHFYBVWKXZQOJ , 192 '635748) ( ; . += ?:/-*$ 

162 



142 



NO. 


BA 


52 


CA 


73 


El 


71 


CA 


75 


540 


PS 


56°7 


CO 


25 


C : 


235 


DC 


1 


541 


■c i 


1 2 1 r 8 


c c 


4977 


0 


6 8 681 


0) 


42 


54 2 


5'J 


12072 


:o 


64 5 


P< 


1 


D, 


94 


545 


= V 


3132 


* r. 


52647 


0 ( 


5 


D> 


102 


54 4 


■J Vi 






12. 1 


D ) 


6' 


0: 


7 


545 


X 


7 


J 


77 


0, 


3 


O' 


40 


54 6 


3 V 


10299 


2 H 


"“6 7099 


D ? 


15 


"D=" 


2" 


547 






CI 


42030 


D* 


1 


D" 


1 


54 8 


3 1 


i 


: j 


7 


D • 


609 


DA 


14963 


549 


32 


i 


:k 


9631 


DP 


2 


DB 


186 


550 


R 5 


2 


2 L 


21124 


04 


9245 


DC 


63 


551 


c 


100435 


^ M 


101 


CR 


324 


PD 


4026 


552 


Si 


6775 


** * t 


1372 


DC 


269' 


DF 


1 15967 


553 


SB 


210 


* -j 


961 96 


DP 


1031 


DF 


68 


5 54 


SC 


’ 8703 


CP 


54, ‘ 


DE 


46579 


~DG 


66 2 


555 


SO 


33 


CO 


309 


DF 


87 


DH 


1 130 


556 


SE 


2 77 54 


:r 


'20252 


DO 


1008 


D 1 


95492 


557 


SF 


714 


:s 


9576 


DH 


527 . 


DJ 


163 


"5 5 B “ 


$G 


'31 


:t 


93119 


01 ~ 


4 0658 


DK 


20 


559 


SH 


4050 


cu 


14678 


3 J 


111 


DL 


816 


560 


si 


’ 25795 " 


: v 


23 


DK 


33 


DM 


2307 


561 


SJ 


23 


cw 


10 


DL 


1015 


DN 


13511 


5o2 


SK 


i i4i 


c X 


53 


DM 


527 


DO 


13475 


563 


SL 


1167 


: y 


17283 


DM 


275 


DP 


459 


”564’ 


SM 


4 346 


:z 


3 22 


on 


3286 


DQ 


" 1 


565 


SM 


226 


:o 


2 


DP 


131 


DR 


31 504 


5 66 ' 


SO 


13095" 


Cl 


105 


DQ 


3 


DS 


17982 


567 


SP 


13333 


C 2 


71 


DP 


8288 


OT 


837 


'563 


' SO 


272 


C 3 


44 


DS 


13 547 


ou " 


24145 


569 


S4 


600 


:4 


34 


DT 


251 


DV 


217 


57 0 


4 4 


9617 


: 5 


47 


DU 


17695 


DW 


240 


571 


ST 


43445 


C 6 


42 


OV 


327 


DX 


56 


57? 


SU 


10086 ’ 


"c 7 


16 


DW 


418 


DY 


12081 


573 


sv 


124 


C 8 


19 


DX 


6 


OZ 


197 


574 


SW 


569 


: 9 


19 


DY 


6111 


DO 


5 


575 


. SY 


6 764 


0 


117249 


OZ 


47 


0 1 


42 


576 


SZ 


23 


0 . 


6110 


DO 


3 


D2 


" 77 


577 


SI 


5 


D< 


102 


01 


2 


D3 


127 


578 


S3 


i 


0* 


15 


D 3 


2 


04 


27 


579 


S 4 


1 


0$ 


11 


04 


1 


D5 


16 


580 


S6 


1 


0* 


1 


09 


1 


06 


14 



Figure 6-18 
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Figure 6-19 

FREQUENCY-ORDERED BIGRAM LISTS 
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Table 6-6 

DATA BASE BIGRAM COMPARISON 
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• for CA (based on Vol. 73) 

EIOANTSRLCDMHPUYFGBV.X ,K-WZ ) ;QJ123 ( '450978+/$=*? : 

• for CA (based on Vol. 75) 

EIONtfTARSLCDMUHPYFGBVXZWKQ21 ,34(0. 56J79)8 ' + :=?;*$/- 

Just as the Least Common Bigram affects search time, (Section 
5.6) so does the individual character frequency, though not to 
so great a degree. The SEARCH program executes the built-in 
function INDEX over a million times in an average run, and the 
time required for this execution is dependent upon the relative 
positions of characters in the look-up string. For maximum 
efficiency, these characters should be ordered in decreasing 
frequency order of single characters in the data base. 

6.3 Data Base Terminology Variation 

One of the problems associated with profile preparation 
is the use of identical terms in different data bases. Tech- 
nically, a profile can be run against multiple data bases and 
will cause hits only in the data bases where the terms occur. 
Although it can be (and is) done, it is not the best method-- 
the same term in multiple data bases can have different meanings 
or provide a different degree of specificity because of the 
nature of the file; for example, the term ACID as used in a 
chemical data base, an engineering data base, and sociological 
data base would function differently. In Chemical Abstracts, 
(Figure 6-20) it would be a non-specific term of high frequency 
(11,868 occurences in 1/2 year) that would have to be "AND"d to 
other terms. In Engineering Index the term ACID would be a reas- 
onably specific low frequency term (See Figure 6- 21) (253 occur- 
ences in 1/2 year) that might even stand alone as a search term. 

In a sociological data base the term ACID would probably refer 
to LSD. 

Another set of examples of variation in terminology among 
data bases is given in Figure 6-21 where we see, for example, 
that proper names, compounds, formulas, isotopes, and Greek 
letters are represented differently in CA, BA, and El. 
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*ACID* APPEARANCES IN CA 
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*ACID* APPEARANCES IN El 



Data Base 



Term Representation 



I 

I 

I 

I 

I 

l 

I 

F 

I 

I 

I 

I 



CA 

CA 

CA 

BA 

El 



CA 

BA 



CA 

BA 



CA 

El 



CA 

CA 

CA 

CA 

El 

El 



CA 

BA 



John Q. Public Jr. (proper name) 

PUBLIC JOHN QUINCY, JR 
PUBLIC JOHN Q, JR 
PUBLIC J Q, JR 
PUBLIC JQ 
PUBLIC, JR JQ 

Lipoprotein (compound type) 

LIPOPROTEIN 
LIPO PROTEIN 

New York (city) 

NEW YORK 
NEW-YORK 

Sulfuric Acid (H 2 S0 4 ) (formula) 

H2S04 
H//2 SO// 4 

Carbon 12 (isotope) 

CARBON-1 2-LABELLED 
C-l 2 -LABELLED 
CARBON 12 
C 12 
**1**2C 
CARBON 12 

Alpha (Greek letter) 

.ALPHA. 

ALPHA 



Figure 6-22 

VARIATION IN TERM REPRESENTATION 
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USER AIDS 



In order to assist the user in writing and monitoring 
his profile, including selection and truncation of terms, 

CSC has prepared a number of user aids in the form of docu- 
ments, computer generated lists, and output card information. 

The CSC Search Manual explains the basic techniques of 
profile writing. A Supplemental Guide has been written for 
each data base. The guide demonstrates profile writing 
tailored to the specific data base. A Truncation Guide 
illustrates where to truncate a term in order to retrieve 
the maximum relevant words with the minimum noise. For 
example, Figure 7-2 from the Truncation Guide demonstrates 
the retrieval ability of various forms of terms related to 
the concept "analysis." 

Frequency Lists in Frequency Order and Frequency Lists 
in Alphabetic Order are prepared for each data base. (See 
Figures 6-10 and 6-7 ). These lists are used as rough indi- 
cators of the volume of output one might expect to receive 
for specific terms. They are prepared for one volume at a 
time for each data base and are updated periodically. 

Key -Let ter -in -Con text (KLIC) indexes are prepared for 
each data base. The KLIC indexes indicate where letter combin- 
ations occur. They are used in conjunction with our Bigram 
Frequency lists which provide a frequency count for every 
two- letter combination (bigram) in the data base. 

As further aids to users in monitoring their output, 
Index Terms and Hit Terms are printed on each output card 
(see Figure 3-2) to provide the user with information for 
revising his profiles; Search Term Frequency/Issue listings 
are generated for each profile to show the user the frequency 
of occurrence in the issue searched for each term in his 
profile . 

7 . 1 Search Manual and Supplemental Guide 

In preparation for user education, workshops, and 
training seminars, IITRI developed a Search Manual „ The 
manual was designed to assist CSC users in developing indi- 
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vidua lized search profiles for use with the CSC system. 

In preparing a profile the user prepares the detailed speci- 
fications he requires for retrieving citations from a data 
base. The manual explains the problems and techniques 
associated with development of search profiles. Problem 
areas include: the inflexibility of machinable data bases; 

the variety of word forms (grammatic, semantic, syntactic, 
and generic); the variety of conventions employed for 
abbreviations, symbols, and acronyms; the varied practices, 
degrees of specification, and presence or absence of controls 
employed in indexing and classification; and the variety of 
nomenclature used within and among data bases. 

The special techniques of profile preparation are: 
determination of search terms --including synonyms, higher 
and lower generic terms, and related terms; determination 
of searchable entries other than subject terms, such as 
authors; the use of left and right truncation for retrieval 
based on term fragments and distinctive letter combinations; 
the use of links for grouping of related terms within a 
logic expression; development of free-form logic expressions 
employing the Boolean operators AND, OR, and NOT; and the 
assignment of weights to profile terms in accordance with 
relative importance of terms to the user. 

A Supplemental Guide has been prepared for each data 
base searched. The guide provides information about the 
use of data elements that are specific to the particular 
data base and demonstrates profile writing techniques for 
that data base. 

7.2 Key-Letter-in-Context (KLIC) Indexes 

Key-Letter-in-Context (KLIC) indexes* are prepared for 
each data base to assist users in selecting term fragments. 
The KLIC index is prepared from title and keyword terms 



*We are indebted to Dr. Anthony Kent of the University of 
Nottingham for the concept of the KLIC index and the 
insight into its utility. 
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appearing in the data base. A KLIC index is similar to 
a Key-Word-in-Context (KWIC) index but is confined to a 
single term and alphabetizes the term separately under 
each of its constituent characters indicating preceding and 
following characters as they are wrapped around the distin- 
guishing characterise KLIC index i s a lexicographic order- 
ing of terms in a data base by each character (alpha, 
numeric, or special) in the term or character string. It 
is a permuted term arrangement sorted by character. The 



format of a KLIC index is shown in Figure 6-15. 

A user can see what the potential retrieval may be from 



using a term fragment in any of the truncation modes. The 
KLIC index is especially helpful in selecting fragments 
with left truncation or both left and right truncation. 

A program, KLICPT, prints the KLIC index of all the 
words in a four column format with the frequency of the term 
following the term as shown in Figure 6- 15. Delimiters for 
uerms in the KLIC index are asterisks , slashes, and blanks. 
Each entry in the index appears in a 21-character line, with 
the eleventh character as the sort character. A double 
slash (//) is used as a word delimiter, and the words are 
wrapped around the central sort character. The KLIC index 
is used fo. linguistic research and as a user aid. By con- 
sulting the KLIC index one can determine the retrieval 



capability of a particular letter combination or term frag- 
ment. The KLIC index is used to identify letter combinations 
that are highly specific and would therefore be discrimina- 
ting search terms, e.g., the character string *YBD* does 
not occur anywhere in the CA or BA data bases except in the 
term MOLYBDENUM (Note: in a literary data base it would 

occur in the mythological characters SCYLLA and CHARYBDIS). 
Thus, >YBD^ could be used as a search term for molybdenum. 

On the other hand, letter combinations that occur frequently 
xn many irrelevant terms should be avoided, e.g., the letters 
RNA for ribonucleic acid could be used as a search term 
assuming one did not specify simultaneous left and right 
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truncation *RNA*. The simultaneous truncation mode would 
retrieve more than 200 irrelevant words. Some of these are: 
ALTERNATE 
BARNACLE 
CARNATION 
DIURNAL 
FINGERNAIL 
MATERNAL 

From Figure 6- 15 it can be seen that a user who employs 
the search term *ACID might expect different terms to be 
retrieved. *ACID * might retrieve terms that include 
several cases of singular and plural word forms. The 
KLIC Index can only be used as a general guide, as the terms 
appearing in any given issue of a data base will not necessari 
correspond with the list appearing in the index. 

Also the term fraction occurrences for a given term 
in different data bases will differ, hence searching on the 
same truncated word in different data bases will retrieve 
different terms. 

7 c 3 Term Frequency Lists 

Frequency Lists in frequency Order and Frequency Lists 
in Alphabetic Order have been prepared for each data base. 

They are used to assist in selection of search terms. 

Figure 6- 10 shows a portion of a frequency-ordered term 
frequency list and Figure 6- 7 shows a portion of the alpha- 
betically-ordered term frequency list. A high frequency 
term will produce a high volume of hits unless it is combined 
with another search term or assigned a low weight. For this 
reason we have instituted an automatic check to notify us 
if profiles contain any of the 50 highest frequency terms 
in a given data base. If any of these terms are used they 
must be AND'd to other terms, assigned a low weight, or 
otherwise restricted within a profile. A low frequency word 
might be used independently. Frequency lists are used as 
rough indicators of the volume of output one might expect 
to receive for specific terms. Our frequency lists have been 
prepared for one volume of each data base. 
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7 .4 Truncation Guide and Standard Truncation 
Truncation is used to facilitate retrieval of terms 
containing fragments that are common to two or more differ- 
ent forms of a term. The use of a single fragment will 
retrieve all terms containing that fragment in accordance 
with one of the truncation modes as described in Section 4. 2 . 2 
Individual users were initially allowed to use search 
terms and truncations in a free and uncontrolled manner. 

A study of the resulting aggregated term list revealed 
numerous sets of related term fragments. Examples of related 
fragment sets are shown in Figure 7- 1 . 

For commonly used terms a preferred truncation can be 
selected that will meet three conditions: 

(1 ) The truncated term is a fragment common to a 
set of desired words associated with that 
fragment. 

(2 ) The fragment is unique and its use will not 

retrieve other terms outside of the associated set. 

(3 ) The fragment is the shortest representation that 
preserves uniqueness. 



PREPARE 

PREPARAT* 

PREPARATION 



PUR IF* 
PURIFIC* 
PURIFICAT* 
PURIFICATION 

PURIFY 

PURIF* 



PREPARATION* 



PREPAR* 



*denotes truncation; fragment below line 
preferred truncation. 



SYNTH* 

SYNTHE* 
SYNTHES* 
SYNTHES I* 
SYNTHESIS 
SYNTHE* 

in each set is the 



Figure 7-1 

SETS OF RELATED FRAGMENTS WITH UNCONTROLLED TRUNCATION 



Inasmuch as CSC search time is 
to the number of profile terms in a 



directly proportional 
run, we can reduce search 
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time by establishing standard truncations for concepts used 
by several users. In one of our early runs we found that 
in a number of instances the variety of truncation forms 
used was significant. Accordingly, a Truncation Guide 
has been prepared that lists more than 600 fragments of 
which 151 have been recommended for use. The truncations 
listed in the Guide are all right truncations (mode 2) 
and were selected from a list of common terms that have 
been employed by users of CSC. Each term has beer, placed 
in alphabetical order within a set of words associated with 
varying length term fragments. The words for the alpha- 
betical listings were obtained from Chemical Abstracts 
Service Search Guide; The Condensed Chemical Dictionary . 

5th edition; Webster's Seventh New Collegiate Dictionary : 
and Chemical Abstracts Index . A page from the Truncation 
Guide is shown in Figure 7-2 . 

At the top of each listing appears a set of candidate 
truncations or term fragments. The brackets in each column 
identify the terms in the alphabetical list that would be 
retrieved by the use of the designated fragment. A term 
fragment is considered optimal if it satisfies all three 
of the conditions stated above.. Other fragments may provide 
either over-truncation or under-truncation. In over -truncation, 
the fragment is too short and an overlapping of more than 
one set of associated words occurs leading to the retrieval 
of non-relevant terms. In under-truncation, the fragment is 
too long and a loss of relevant terms may occur due to the 
excessive restriction on the set of terms that can be retrieved. 
In some cases, several fragments of varying lengths will 
retrieve the same set of terms. The shortest fragment is 
then selected as optimal. 

While the CSC Truncation Guide is helpful, one can achieve 
many of the same objectives by using a handbook, dictionary, 
or other list of terms. They can be used: 
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Figure 7-2 

TRUNCATION GUIDE ENTRIES FOR THE CONCEPT "ANALYSIS 



(1 ) To obtain an estimate of the number of discrete 
terms and type of terms that may be retrieved 
by using right truncation with a given term 
fragment. 

(2 ) To balance selection between a longer and shorter 
term fragment. . 

(3 ) To indicate optimal term fragments that are the 
shortest, unique fragment capable of retrieving 
a set of associated words. 

(4 ) To designate fragments for use with terms where a 
seemingly optimal truncation may be ambiguous or 
lead to false retrieval. 

Standard truncations as indicated in the Truncation 
Guide are used not only because they improve retrieval effec- 
tiveness --they also provide a cost savings to CSC as their 
use increases tl.a aggretation ratio for profile input terms. 

A check of several groups of profiles indicated that 10% of 
the terms could employ standard truncations. However, it is 
not always possible to use the optimal truncated form of a 
word as a standard form. In some cases, the data base may 
contain abbreviations that are different from the optimal 
truncation forms of a full word and the abbreviation must be 
used to ensure retrieval. For example, in the case of the 
concept ANALYSIS, CA uses the appreviation ANAL for the group 
of words which we have determined to be best found with the 
optimal truncation form ANALY*. Both terms should be used, 
ANALY* to retrieve from text, and ANAL (no truncation) to 
retrieve from the CA keyword list. In other cases, trun- 
cated words may retrieve too many false drops. For example 
*AMIN* will retrieve various amines, but it will also pick 
up words such as CONTAMINATION. In another case one user 
may be interested in crystals and all forms of the term. 
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He would use the truncated term CRYSTAL*. Another user 
may wish only the process crystallization and not everything 

on crystals or crystallography. Such a user would use 
CRYSTALLIZ* rather than CRYSTAL*. The profile preparer can 
not blindly select truncation forms from the Trunca tion 
Guide or other aids. Each selection must be made in full 
understanding of the profile and the data base. 
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8. USER EVALUATION AND FEEDBACK 



Foi purposes of assessing the degree of user satisfacti 
and usefulness of the SDI, system users have been requested 
to evaluate the relevance of retrieved citations. A rele- 
vant citation is defined as one that is judged by the user 
to satisfy the intent of his profile. Although the output 
citation necessarily satisfied the search terms, logic, and 
associated parameters of the profile inasmuch as they were 
the keys that retrieved the citation, the intent of the 
profile may not be necessarily satisfied. It is the user's 
judgment that is required to discern the discrepancy between 
intent and output and modify his profile accordingly. 

Evaluating or attempting to measure the performance, 
effectiveness, and utility of an information retrieval system 
is difficult for a variety of reasons not the least of which 
is that the users interests may change over a period of 
time. An article that is of interest today may not be of 
great interest to him a month hence and vice versa. Because 
user interests and profiles change, we have requested that 
users evaluate citations at, or as close as possible to, 
the time of receipt from the Center. 

An evaluation form (shown in Figure 8-1) is sent to each 
user with every issue of output. The evaluation form indi- 
cates the number of citations sent for the particular SDI 
run and asks the user how many of these were of interest, 
and of no interest. 

Using the values returned on the evaluation reports, 
the CSC calculates the percent relevance (precision) of 
output for each user. Data are accumulated by user, by com- 
pany and by issue. When these reports indicate that the 
user is receiving too much extraneous material or very little 
pertinent output, the CSC personnel consult the individual 

user with suggestions and assistance if modifications are 
needed. 

If a userfe precision rating runs high or low over several 
runs this usually indicates a problem. If he gets 90 %- 100 % 



precision he is probably missing relevant citations by using 
terms that are highly specific or logic that is overly 
tight. If he gets precision ratings below 257®, he is getting 
too many non-relevant citations and this is probably due to 
the use of high frequency or common terms in an unrestrictive 
manner . 

A high percentage of the forms are returned (see Table 8-1) 
which indicates that the users are looking at their output 
and checking it--at least to the extent of putting the cards 
in the two fill-in groups: relevant and non-re levant „ In 

general, 50 percent of the forms are returned to IITRI 
within two weeks of our mailing. The balance of the forms 
are returned anywhere from three weeks to ten months after 
the mailing. 

Precision was calculated as the number of citations 
considered to be of interest by the user divided by the 
number of citations sent to the user as indicated on the 
returned evaluation reports. The statistics also do not 
consider the fact that when no citations were located, zero 
output might well represent real information and in effect be 
100 percent satisfactory to the user. 

Precision statistics are nresented in Table 8-2. The 
statistics were obtained from I'll searches run on CA Con- 
densates from Volume 71, issue 9 through Volume 76, issue 12. 

Table 8-2 lists average precision ratings of retrieved 
citations by search run. These numbers are affected by the 
content of CA Condensates. Content varies from week to week 
since not all journals are abstracted in every weekly issue. 

Some profiles would therefore be low in citations of interest 
retrieved in a week when the journals in the area of their 
interest are not abstracted. The figures for the 131 weeks 
listed in Table 8-2 vary from a low of 19.0 percent to a 
top figure of 46.5 percent with an average weekly relevance 
of 30.0 percent. The weekly average was calculated by 
averaging the percent relevance of the individual users. The 
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I IT RESEARCH INSTITUTE 

COMPUTER SEARCH CENTER 

10 WEST 35 STREET 
CHICAGO, ILLINOIS 60616 

PHONE: 312/225-9630 

EVALUATION nEPORT 



Date Sent 



Name 

Service Chemical 
Volume 



Profile Number 





1 — 1 


1 — - 1 — 







Abstracts 

Issue 



Series Condensates 
Date of Search 



Number of citations received 

Number of citations considered to be of interest 
Number of citations considered to be of no interest 



Fold 



CSC Comments: 



I 

I 

I 

I 

I 




User Comments : 



Figure 8-1 

USER EVALUATION REPORT FORM 
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CHEMICAL ABSTRACTS CONDENSATES 
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figa :es vary not only because of the availability of material 
on the particular question but also because of the attitude 
of the user toward modifications of his profile. In many 
cases, modifications have been made by CSC personnel and the 
users cooperatively. In other cases, users have taken 
the initiative in modifying their own profiles. But there are 
cases where the user has not wished to modify his profile 
and in these cases, it ?.s possible that citations that could 
be pertinent are being missed or too much irrelevant material 
is being pioduced. This situation does cause some low rel- 
evance ratings for individual users. 

The distribution of the average profile precision for 
profiles that were searched in all of the 131 runs of CA 
Condensates Volume 71, issue 9 through Volume 76, issue 12 is 
shown In Table 8-3. Average precision for 12 issues CA Volume 
76 and 13 issues CA Volume 71 ranged from 0 to 100 percent. 
More than 50 percent had greater than 50 percent relevance. 



Percent Relevance 


CA Volume 76 

(12 issues) 
Percent Profiles 


CA Volume 71 

(15 issues) 
Percent Profile 


0 


16.4 


28.8 


1-10 


14.9 


4.9 


11-20 


13.4 


7.2 


21-30 


8.8 


9.2 


31-40 


8.9 


8.1 


41-50 


9.3 


15.6 


51-60 


5.0 


4.7 


61-70 


5.9 


4.9 


71-80 


3.7 


4.3 


81-90 


3.6 


1.8 


91-100 


10.1 


14.9 



j Table 8-3 

: distribution of average profile precision 
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In addition to the evaluation forms for monitoring pre- 
cision ratings CSC requests users to send back the trailer 
card (see Figure 3-3) from their output after circling the 
citation numbers for the relevant citations. In this way the 
CSC profile coordinator can see exactly which citations are 
considered to be of interest. This helps her to understand 
the user's interest so that she can suggest more meaningful 
profile changes. 

In addition to precision data obtained on a weekly basis 
throughout the program, CSC carried out a study to obtain more 
detailed information and evaluations from its users. In mid- 
June 1970 a questionnaire, User Evaluation of Current Awareness 
Service for Chemical Abstracts Condensates, was sent to all 
current users. 

Table 8-4 is a summary of responses of 51 users of CSC 
SDI system searching CA Condensates for 71 profiles. 
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QUESTION 




IIT/IITRI 


OTHER 

ACADEMIC 


IND. 


TOTAL 


1 


CA available 


yes 


9 


11 


42 


62 






no 




- 






2 


Prior manual 


yes 


4 


6 


23 


33 




search 


no 


5 


5 


18 


J J 
28 


3 


Monitor searches 


yes 


5 


4 


23 


32 






no 


4 


6 


18 


28 


4 


Dispense with 


yes 


6 


3 


2? 


O 1 




manual searches 


no 


1 


7 


17 


J 1 

25 


5a 


Card format 
satis factory 


yes 

no 


8 

1 


8 


37 

3 


53 

4 



Not 



5c Terms causing Useful 

„ . 



hits 

6 Maintain card 
file 



Not 



8 

1 



7 

2 



7 

1 



Table 8-4 

— SUMMARY 

USER EVALUATION OF CA CONDENSATES 
CURRENT AWARENESS SERVICE 
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36 

6 



33 

9 



52 

8 



48 

11 




9. EDUCATION— USER LIAISON 

One problem facing information centers is that of education. 
The machine-readable sources of information are not familiar 
to the average working scientist. In order to familiarize the 
potential users of information centers with the new sources 
and services, we at IITRI have undertaken a number of education- 
al activities. Education must pave the way for marketing. 

The means we have undertaken include: development and/or 

conduct of workshops, seminars, university courses, short courses, 
workbooks, technical presentations, publications, and mass 
mailings. Once a user has entered one or more profiles in our 
system, it is necessary to maintain liaison with him for 
modification of his profile as changes occur in the data bases 
and/or his interests. Both aspects of center-user interaction 
are described in this section, that of basic education in the 
utility of SDI services from machine-readable data bases and 
that of continuing liaison while servicing a profile. 

The educational aspects of our workshops, seminars, etc. , 
are devoted to providing basic information on machine-readable 
data bases and their use, in terms of data base contents and 
limitations, machine search capabilities and limitations and the 
advantages of mechanized SDI service. There are many advantages 
to using SDI services of information centers. Our system was 
designed to provide many advantages and through the past three 
and a half years of operating experience we have both become 
aware of more advantages and gained considerable data to sub- 
stantiate our original assumptions. The most obvious reasons 
for using SDI services include: (1) coverage, (2) thoroughness 

of search, (3) consistency of search, (4) in terdisciplinariness , 

recall, (6) cost-effectiveness, (7) speed and regularity, 
(8) timeliness, (9) multiplicity of data bases, (10) automatic 
preparation of files in standardized format, and (11) cost of 
data base preparation and operation of an SDI system vs. sub- 
scriptions. Further details on these eleven items are presented 
in a paper entitled "Handling of Varied Data Bases in an 
Information Center Environment" published in the Proceedings 
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of the Conference on Computers in Chemical Education and 
Research, Northern Illinois University, DeKalb, Illinois, 

July 19-23, 1971. 3 

9 . 1 Workshops on Computer Retrieval of Scientific 
Information 

We have conducted four workshops for industrial and aca- 
demic participants in the use of computer techniques for re- 
trieval of scientific information. They were held on January 
19-21, 1971, May 3-7, 1971, Dec. 1-3, 1971 and April 19-21, 

1972. Another is planned for November or December of 1972. 

Each of these Workshops consists of an intensive 21 -day program 
of lectures and "hands-on" use of the CSC's SDI service. 

Figures 9-1 ana 9-2 show the front and back of the Workshop 
announcement brochure that is mailed to prospective participants. 
CSC staff members give lectures on: CSC philosophy and opera- 

tions; techniques for preparing search profiles including use 
of data elements, truncation, links, logic, and weights; the 
characteristics of data bases ; use of aids such as frequency 
lists, KLIC (Key-Le tter-in -Con text) indexes, and truncation 
guides; theory of retrieval evaluation including recall, pre- 
cision and feedback; and on modification of search questions. 

Attendees write profiles to reflect their areas of interest. 
Profiles are run against representative issues of CA Condensates, 
BA Previews, and/or El COMPENDEX. Following the first run 
attendees conduct manual searches of the appropriate hard copies 
of CA, BA, and/or El to compare the results of the machine 
search against manual searches. Profiles are then evaluated 
and modified and submitted for a second machine search against 
the same data bases. Output from the second run is also evalua- 
ted by attendees. 

Figure 9-3 presents data on recall and precision taken 
from the CA searches made by participants of the third Workshop. 
Since both manual and machine searches are made, it is possible 
to calculate both recall and precision. The increase in both of 

these indicators after profile revision has been observed in all 
Workshops . 
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Figure 9-1 

WORKSHOP ANNOUNCEMENT BROCHURE — FRONT SIDE 
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December 1, 1971 








MANUAL 


MACHINE 






Profile 

Number 


Cits. 

Ret'd 


Cits. Rele- Total 

Ret'd vant Rel. 


Re- 

call 


Pre- 

cision 



001-1 


5 


9 


3 


5 


60 


75 


001-2 


0 


0 


0 


0 


N/A 


N/A 


001-3 


2 


0 


0 


2 


0 


N/A 


002-1 


6 


3 


2 


7 


29 


67 


002-2 


8 


14 


11 


11 


100 


79 


003-1 


8 


6 


4 


8 


50 


67 


004-1 


4 


13 


4 


4 


100 


30 


005-1 


0 


0 


0 


0 


0.. 


0 


005-2 


4 


6 


2 


6 


67' 


33 


006-1 


1 


1 


1 


1 


100 


100 


006-2 


0 


2 


0 


0 


N/A 


0 


007-1 


10 


3 


1 


10 


10 


33 


008-1 














009-1 


2 


1 


1 


2 


50 


100 


009-2 


3 


3 


3 


3 


100 


100 


010-1 


1 


13 


1 


1 


100 


8 


010-2 


1 


7 


1 


1 


100 


13 


011-1 


14 


21 


11 


16 


69 


52 


012-1 


74 


25 


7 


80 


9 


28 








Average 


59% 


49% 


i 




December 2, 1971 






001-1 


5 


8 


7 


7 


100 


88 


001-2 


2 


2 


2 


2 


100 


100 


002-1 


8 


8 


5 


11 


73 


62 


002-2 


14 


22 


18 


18 


100 


82 


003-1 














004-1 


4 


5 


4 


4 


100 


80 


005-2 j 


4 i 


6 


3 


7 


43 


50 


005-3 


14 


16 


10 


19 


53 


63 


006-3 


14 


14 


9 


17 


53 


64 


007-1 














007-2 


3 


12 


11 


13 


84 


91 


008-2 














009-3 


1 


3 


1 


1 


100 


33 


010-1 


1 


6 


1 


1 


100 


17 


010-2 


1 


4 . 


1 


1 


100 


25 


010-3 


7 


9 


9 


9 


56 


100 


011-1 


16 


12 


9 


16 


56 


75 


011-2 


2 


1 


1 


2 


50 


100 


011-3 


16 


15 


6 


6 


100 


40 


012-1 














013-1 














013-2 




















Average 


79 % 


67% 



Figure 9-3 
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We have found the 2^-day period with opportunity for two 
manual and two machine searches to be a good format for presenta- 
tion of this material. We have limited attendance to 20 to 30 people 
since we have found that individual instruction yields the best 
results. Although the workshops have proved to be a good source 
for continuing subscribers, they have frequently been attended 
by representatives of organizations that plan to implement 
their own system. Figure 9-4 presents a tabulation of the 
affiliations of attendees at the four Workshops. Those in the 
industrial area constitute more than 40% of the total par tic i- 
pan ts. 

9.2 Seminars 

We have also conducted no-fee seminars as well as the more 
highly-structured Workshops. Seminars are two to four hours 
in length and are comprised primarily of the lecture portion of 
the Workshop material. Seminars are usually held for an in- 
dividual company or university and will be conducted either 
at IITRI or on-site at the organization, depending upon their 
wishes. A large number of such seminars have been held and 
are listed in Section 10 of this report. 

A general type of seminar on the CSC has also been pre- 
sented as a case study within the framework of workshops con- 
ducted by the National Federation of Scientific Indexing and 
Abstracting Services (NFSAIS) . We have presented this case 
study at NFSAIS workshops in Cleveland, Chicago, and New York, 
and the next is planned for Houston in mid-October of 1972. 

9.3 University Courses 

One of the more significant educational efforts has 
been carried out in cooperation with Illinois Institute of 
Technology, the university with which IITRI is affiliated. 

During the 1969, 1970, 1971, and 1972 spring semesters a new 
course was offered at IIT, "Modern Techniques in Chemical 
Information." The course was made available to second year 
graduate and upper division undergraduate students in the 
Chemistry Department. This course replaced the traditional 
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chemical literature course and the chemistry graduate students 
were given the option of taking the Modern Techniques course 
in lieu of a second foreign language. One hundred percent of 
the graduate students opted for the course. Members of the 
IIT staff who serve on graduate advisory committees willingly 
accepted this change as a significant improvement in the formal 
training for the Ph.D degree. One of the reasons for enthusiastic 
acceptance of the course is that it presents a solid basis for 
the understanding and use of chemical information systems in 
the context of a 2-credit hour one-semester course. 

The course was made available through a sub— contract from 
the IITRI Computer Search Center program to the Chemistry 
Department at IIT and the course was taught by Dr. Paul E. 

Fanta of IIT and Miss Martha E. Williams of IITRI. 

The course covered techniques of storage, search and 
retrieval of chemical information. Specifically, it stressed 
the fact that chemical information exists in many different 
forms, both printed and machine-readable, and if the chemist 
is to make good use of the multiplicity of available data bases 
and collections, he must expand his horizons and be prepared 
to use the computerized files as well as the traditional 
collections. Information resources and methods of retrieval 
were considered from the viewpoint of information systems and 
the general problem was considered to be the retrieval of spe- 
cific data from a data store. 

Inasmuch as none of the available chemical literature 
textbooks provide adequate coverage of the modern techniques 
and sources of chemical information, staff members from both 
IITRI and IIT (Mr. Eugene S. Schwartz and Miss Martha E. 

Williams of IITRI and Dr. Paul Fanta of IIT) developed a 
syllabus and workbook for the course, "Modern Techniques." 

The objectives and contents of the book are described in the 
following section. 

In addition to acquainting the student with the traditional 
and modern methods of handling information, each of the students 
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participated in an SDI program. Instruction in profile pre- 
paration was provided both through lectures and through study 
of the Search Manual . Students became acquainted with the 
problems and techniques associated with development of interest 
profiles including selection of terms, truncation of term frag- 
ments, development of expression for proper logical association 
of terms, use of links for grouping terms within an expression, 
and assignment of weights. 

The machine -readable data base used for the student SDI 
experiment was Chemical Abstracts Condensates. In the first 
year, students conducted manual searches of an issue of Chemical 
Abstracts in two subject areas, one organic and the other 
inorganic. In the second and third years students conducted 
manual searches of two issues of Chemical Abstracts. After 
completing the manual searches, they prepared interest profiles 
which were used by IITRI in a search of the corresponding issues 
of the Condensates tapes. Output from the SDI run was returned 
to the students for comparison with output from their manual 
searches. 

In many cases extremely good profiles were prepared with 
good relevance ratings. In other cases profiles were defective 
for several reasons. In all cases, after the students completed 
the assigned evaluation and comparison of their manual versus 
machine searches, they understood and were able to explain why 
their profiles were effective or ineffective. The time saved 
by the computer search was dramatic and impressed students who 
had had to spend considerable time in conducting the manual 
searches . 

From the viewpoint of both instructors and students, the 
course accomplished its major objective, i.e., it provided a 
survey of traditional techniques of chemical literature, and 
showed the relationship of those techniques to modern search 
me thods . 

Another objective was to make the students sufficiently 
aware of the capabilities of computer services so that when 
they enter the industrial community, they will request such 
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services. These students will be the future chemists and users 
of computerized chemical information systems. Hopefully, in 
much the same way that students who use modern analytical 
equipment in their university laboratories demand modern equip- 
ment in the industrial laboratories that hire them, so students 
familiar with automated information handling will require these 
services from their employers. 

Miss Williams is currently discussing preparation of 
short courses and/or audio cassette courses based on the 
"Modern Techniques" course with the American Chemical Society 
and others. 

9 . 4 Workbook for Modern Techniques in Chemical Information 

The absence of any textbook providing adequate cover- 
age of the modern techniques for search and retrieval of chemical 
information, and of the newer--principally machine-readable-- 
sources of chemical information prompted IITRI's development of 
a workbook entitled Modern Techniques in Chemical Information . 

The book was designed for use by chemists and does not 
require a background in computer technology, programming^ or 
information science. It exposes the student to the potentials 
and limitations of information systems and sources and explains 
the storage, search, retrieval, and dissemination functions 
that characterize information systems. 

The chapters or principal topics are: (1) "Information 

Systems," (2) "Indexing and Classification," (3) "Primary 
Information Sources in Literature," (4) "Patents," (5) "Second- 
ary Information Sources in Literature-1: Abstracting Periodi- 

cals, Review Serials," (6) "Secondary Information Sources in 
Literature-2: Reference Works," (7) "Chemical Information 

Centers" including the computer searchable data bases and 
computer centers, (8) "Chemical Structures in Literature and 
Machine," (9) "Search Systems" including an introduction to 
computer components, programming languages, programming, 

and computer systems, (10) "Information Retrieval in a Current 
Awareness System." 



The workbook was tested via the IIT course in 1969, 1970, 
1971, and 1972. a proposal for development of a textbook based 
on the workbook has been submitted to NSF. After review and re- 
vision have been completed it will be published and distributed. 

9.5 Technical Presentations, Publications, and Mass Mailings 

The final methodology for educating potential users has 

been that comprising presentations at technical meetings, pre- 
parations of technical publications, and mass mailing of brief 
descriptions of the CSC. A listing of presentations and pub- 
lications is given in Section 12 and the mass mailings are 
discussed in Section 10.5.1. 

9.6 User Liaison 

In a system that was designed to be us er -oriented , frequent 
communication with users through various channels is extremely 
important. In order to maintain good rapport with users and to 
be sure that their profiles are functioning efficiently to 
provide the desired information, CSC uses many avenues of 
communicacion with users. Among them are: 

• unlimited profile changes 

• low-cost profile switch 

o evaluation reports 

• feedback cards 

• continuous precision calculations 

• telephone contact 

9 comments on profiles to suggest changes 
in logic weighting, and grouping of 
terms, or to suggest use of new data 
elements or new terminology 

• site visits 

The concern for users is of extreme importance to infor- 
mation centers. Information systems are designed "o be used 
and if the clients are not satisfied with the service, they 
will not use it. 
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CENTER MANAGEMENT AND PROCEDURES 



10.1 CSC Profile Handling Procedures 

10.1.1 Receive Search Request 

Search requests are received either by telephone, mail, or 
site visits. These requests may be made either directly by the 
researcher or indirectly through his representative. It is best, 
where possible, to discuss the search subject directly with the 
researcher . 

10.1.1.1 Review and Interpret Search Question 

The user's statement of his question is read and carefully 
studied. If the meaning of the question is not completely clear, 
the user is called to discuss his information needs. He is asked 
to identify pertinent search terms and synonyms, titles of per- 
tinent papers, key authors and/or journals, etc. When the questions 
are received via telephone, full details are written during the 
call and, if possible, the requestor is asked for written confirmation. 
Or, he is sent a letter with the CSC interpretation of the question 
and/or a copy of the proposed profile for his review and comment. 

10.1.1.2 Conduct Manual Search 

In order to get a feel for a specific research area and to 
determine how this material is handled in a specific data base, 
a manual search is carried out. This manual search is conducted 
in the appropriate hard copy counterpart of the data base against 
which the question will run to determine useful search terms and 
strategy for the profile. Hard copy indexes are checked to identify 
additional related terms. Dictionaries, encyclopedias, etc. also 
assist in further defining the question and in identifying candidate 
terms for the profile. All worksheets prepared in development of 
the search strategy are kept in the profile folders. 
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10.1.2 Profile Handling Procedures--New Profiles 

The following are the steps required in preparing a new 
profile for CSC: 



• Review the subject of the search question and select 
the appropriate data base(s) against which the pro- 
file is to be run. 

• Select the appropriate profile form(s). These are 
entitled "Computer Search Cen ter--Search Profile- 
Header" (form Pi), "Computer Search Center- -Search 
Profile Terms" (form P2), and "Computer Search Center-- 
Search Profile - Logic" (form P3). (See Figures 10-1, 10-2 
and 10-3.) The forms for CA are reproduced on white 
paper; for El, on yellow; and for BA, on green. 

• Select candidate search terms. The selection of ap- 
propriate terms can only be done after gaining a 
good understanding of the search question in relation 
to the user's needs and in relation to the specific 
data base(s) against which the profile will be run. 

• Check these candidate search terms for correct trunca- 
tion and frequency of usage using the Truncation 
Guide, term lists, and/or the KLIC Index for the 
appropriate data base(s). 

• Prepare the profile form using the profile check list. 

(See Figure 10-4.) 

• Assign a profile number identifying the organization 
(corporation) from the organization code book. Profiles 
for a specific organization are numbered consecutively 
based on order of arrival. See Section 10.1.4 for 
details of CSC profile number designations. 

• Prepare a User Record sheet for each new profile. 

(See Figure 10-5.) If this question comes from a new 
organization, a Corporate-User Record sheet must be 
prepared. (See Figure 10-6 • ) 

• Prepare file cards, folders, etc., for each new profile. 

(A description of CSC files follows in this report). 

• Xerox a copy of the completed profile. This copy is 
sent to the user with his first run. 



• Prepare a cover letter to be sent with the output 

from the first run. There is a "standard" cover letter. 
(See Figure 10-7.) Special comments relating to a 
specific profile are added to this basic letter. 



• The completed, checked profile is then ready to be key- 
punched. 

a Record the profile number on the Profile Deck Modification 
sheet along with any appropriate comments, i.e., odd 
only new profile, etc. This sheet lists all new, mod- 
ified, or dropped profiles for each run. (See Figure 10-8.) 
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Check each of the following as considered 

Profile form correct for data base to be searched, CA - white^ 

BA - green. 

Profile number coded correctly according to data base and user-type. 

Profile number recorded on each page. 

User name and address correct and complete. 

User phone number recorded and complete. 

U ser name recorded on each page of profile. 

Each page of profile form numbered. 

Statement of search question as detailed as possible. All available 
information recorded. 

CA search coverage (even, odd, both) recorded. 

# of terms recorded corresponds to # of terms listed. 

# of links recorded corresponds to # of links listed. 

O utput limit recorded. 

Threshold weight recorded. Does this correctly represent the question? 

If weights assigned, do all terms have weight recorded? Do the " not 1 1 
terms nave zero weight? 

M edium and Sort recorded. 

Letters, "I" and and numbers "0", "1" and "2" correctly written. 

T erm-types are correctly recorded. 

T erms correctly spelled. 

T erm-truncation and frequencies checked. 

Truncation modes are correctly recorded. 

T erms are correctly numbered. 

E ach link contains 2 or more terms. 

A ll terms and links are accounted for in the logic. 

L ogic statement correctly expresses search question. 

Logic statement is clearly and correctly printed with brackets in place. 

M odifications are recorded, dated and initialled. 

5/4/72 - PAL 
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] Reference: 

Dear 

Enclosed is the output for the first run of your profile (s) 
and a xerox copy of the profile (s) which was written for your 
search question. 

With each issue's run, you will receive an evaluation form. 

! We would appreciate your filling in the two blanks (indicating 

the number of citations that were of interest and the number 
that were not) and returning the evaluation form to us. Also, 
the final card in your output lists the reference number for 
each citation included in your printout. Please circle the 
reference number of each citation which was of interest to you 
T a nd return this card to us. These two forms are used in helping 

.! to modify your profile and in collecting general statistics on 

the runs. The data obtained from the forms is in no way con- 
r nected with your company name or the subject of the profile. 

^ We are happy to discuss your profile at any time. If you 

have any questions or comments, please do not hesitate to call. 

I Questions regarding Chemical Abstracts or Biological Abstracts 

I profiles should be directed to Patricia Llewellen (x5031) or 

Margaret Scheibe (x5028). Questions regarding Engineering Index 
| profiles should be directed to Alan Stewart (x5364). 

I 

I 
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10.1.3 Profile Handling Procedures--Modified Profiles 

A copy of the current version of the profile must be main- 
tained in the file. Xerox this profile (N.B. , return the orig- 
inal to the file) and use the copy as a working copy during 
modification. 

o Attach a complete new form PI to the profile form. 
For CA this is white; for El, yellow; and for BA, 
green . 

© Make all necessary changes on the Xerox copy of the 
profile, date and initial all changes on form PI, 
and record the reasons for tne change(s). Where 
changes are extensive, prepare a complete new set 
of profile forms. 

® Assign a modification number. This involves a 

change in the tenth character of the profile number, 
e.g., A — B— V C— D, etc. 

9 A Xeroxed copy of this modified profile is made. 

Send this to the user with the first output from 
this modification. 

e Make necessary changes in the profile records, i.e., 
change user name, output limit, output frequency, 
etc . 

10.1.4 Designation of CSC Profile Numbers 

10.1.4.1 CSC Profile Number 

A CSC profile number consists of ten alphanumeric characters 
in the following form: AN-ANN-NNN-NA (A indicates a letter; N, 

a number) . 



Character 


1 


indicates the data base, e.g., B 
indicates BA, C indicates CA, and 
E indicates El. 


Character 


2 


indicates odd (1), even (2), or 
both (3) issues of CA. (1) used 
for BA & El. 


Character 


3 


indicates user-type classification. 


Characters 


4-5 


indicate the corporate number (the 
user) . 


Characters 


6-7-8 


indicate the user within a corpora- 
tion . 


Character 


9 


indicates the profile (of a user) . 


Character 


10 


indicates the modification version. 
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PROFILE DECK MODIFICATION 
SHEET 



are as 



10.1.4.2 


User-Type Classification 


General < 
follows: 


classifications that indicate type of user 


A-F 


Academic 


G-K 


Independent Research Organization 


L-R 


Industrial 


S 


Workshop 


W&Y 


Government 


X 


Experimental and Standard Profiles 



10.1.5 Keypunch Profile 



All prof iles--new or modified--are entered into the system 
within five working days of receipt by CSC. Keypunching is scheduled 
to meet this objective. All keypunching is proofread twice, first 
by the keypuncher and second by someone other than the keypuncher. 

10.1.6 Enter Profile in Input Data Dec k 

The profile keypunch cards and the Profile Deck Modifications 
sheet are correlated. These records are later checked against 

DKEDIT and MINIPUP to assure all new or modified profiles are 
accounted for. 

10.1. 7 Check Output and Prepare Mailing 

Before the output is packaged, output for new profiles and 
revised profiles must be carefully checked. The retrieved citations 
are reviewed for technical value to the search question. Consider- 
ation is given to the value of material in the data base, non-per- 
tinent citations due to faulty logic, misinterpretation of a con- 
cept, etc. Search terms are rechecked for spelling, truncation, 
term type designations, and search logic in relation to the par- 
ticular citations retrieved. If a serious error or problem has 
occurred, the user is called to discuss this problem and to discuss 
the procedures required to correct it before the next run. With 
each new profile, the cover letter (see Figure 10-7) is sent to the 

user requesting his return of the evaluation report sheet .(see Figure 
8-1) and trailer card (see Figure 10-9) and explaining how they are 
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filled out. A copy of the profile is sent with the first run 
from each new or modified profile. 

After the output is received and reviewed it is packaged and 
labeled for delivery. 

10.1.8 Monitoring Profiles 

10.1.8.1 Monitoring New Profiles 

All new profiles are carefully monitored for the first four 
or five runs. User evaluation forms are checked for specific 
comments on the output as well as for identification of pertinent 
and non-pertinent citations. If these evaluations are not re- 
turned promptly the user should be called to discuss the output. 
Typical questions to ask him are: Has the output been relevant? 

Have there been specific problem areas in all the output sent so 
far (i.e., have all non-relevant references come from a given 
section of CA)? Does the user know of "missed" citations? Dis- 
cuss the search question again. Have the search results pointed 
out areas of interest or non-interest that the user had not con- 
sidered before? Determine what changes are necessary to improve the 
usefulness of the output. Review user requests to add, delete or 
change terms and/or logic carefully. If there appear to be problems 

implementing the request, call the user to discuss his new information 
needs . 



10.1.8.2 Monitoring Existing Profiles 

After a profile has been stabilized, output is checked every 
four to six runs to be sure no error or change in data base format 
is affecting the search results. The user is called every two to 
three months to 1) check on the performance of the profile, 2) apprise 
him of any new searchable data elements, and 3) determine if his 
information needs are changing due to changes in his research in- 
terests. If the profile needs modifications, these should be dis- 
cussed and implemented. User requests for changes should be care- 
fully reviewed to determine their effect on the profile output. 
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10.1.9 Dropping of Profiles 



A user may request that a specific profile be dropped. The 
appropriate file changes and deletions must be made. 

These include: 

o Prepare two DROP cards--one for the term and one for 
the logic section of the deck. A DROP card has the 
profile number in columns i-io and the word DROP in 
columns 11-14. 

o Complete a CSC Profile DROP Checklist for each Corporate 
User Record sheet. (see Figure 10-10) 

o Complete the User Record sheet. (see Figure 10-5) 

o File Profile folder in Dropped file drawer. 

o Record "DROP" status on User subscription card, on 
Profile Hit Record Sheet(s), and in Profile History 
Book. 

10.2 C enter Files and User Record s 

10.2.1 User Record File 

Individual user records are maintained on 8-1/2" x 11" sheets. 

Each sheet includes the user's name, company affiliation, mailing 
telephone number, and individual profile number. Profile 
modification and status are recorded. These sheets are filed alpha- 
numerica lly by corporate and user code. 

10.2.2 Corporate-User Record File 

This file is made up of 8-1/2" x 11" sheets which identify company 
name, address, corporate code, company contact(s) , and telephone 
number (s) . They are arranged alphanumer ically by corporate code in 
a Corporate-User Record File book. The Corporate User Record sheet 
also indicates the name of each user within the company and the number 
of profiles he has running in the system, (see Figure 10-6) 

10*2.3 CSC User Profile History Book (Restricted Data) 

This book contains detailed data on profiles for each CSC user. 
Information in this book includes individual user name, corporate 
name and telephone number, status of profile, and dates. This in- 
formation is arranged alphabetically by corporate user. This mater- 
ial is updated daily. 
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CSC Profile DROP Checklist 



Profile number (s) 



Corporate Code 



Date 

Initial 



Reason (s) for dropping: end of free-trial, not purchased 

end of subscription, not repurchased 

Give specific reason(s) for termination of services: 



Procedural check : 

"Drop" status recorded on Corporate User Sheet 

User Record Sheet completed, pulled and refiled in profile folder 

Profile folder refiled in appropriate "dropped" file drawer 

User subscription cards pulled and refiled 

"Drop" status and date recorded on Profile Hit Record Sheet (s) 

"Drop" cards keypunched • and submitted to appropriate deck(s) 

"Drop" status and date recorded in Profile History Book. 

Figure 10-10 

DROP CHECKLIST 



CSC 

6/1/72 PAL 
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10.2.4 CSC Profile Folder Files 



Active Profile Folders files are arranged alphanumerical ly 
by profile number for each data base. There is a folder in the 
file for each active profile number. This folder contains a copy 
of each profile modification. A master folder of company cor- 
respondence and contact information is filed immediately in front of 
each company's set of individual user profiles. 

Inactive (dropped) Profile Folder files are filed alpha- 
numerically by profile number. 

10.2.5 Profile Correspondence F ile 

Correspndence related to specific profile activities, e.g,, 
term additions, deletions, modifications, etc., is filed in a folder 
directly in front of each company's set of profiles, 

10.2.6 Telephone Number File 

The telephone number file is maintained on 4" x 6" cards. 

Phone numbers are referenced in two ways. One half of the file is 
arranged by company name. The other half is arranged by user 
name. This file is used: for easy access to company and/or indi- 
vidual user telephone number or for rapid identification of corporate 
code. Many user contacts (by letter or telephone) are made without 
identifying profile numbers. Profile file locations can be ident- 
ified rapidly using this file. (See Figure 10-11.) 

10*2.7 Profile Hit Evaluation File 

This file is made up of 11" x 17" fold-out sheets on which are 
recorded weekly (CA) profile hit statistics. The number of hits 
received per search and user evaluation results are recorded for 
each profile. These sheets are arranged alphanumerically according 
to code. (See Figure 10-12.) 

10.2.8 Evaluation Report File 

On these 8V x 11" sheets are recorded profile number, user name, 
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Chemistry Research Division 
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IIT Research Institute 
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10 West 35th St. 
Chicago/ 111. 60616 

X5031 



Figure 10-11 
TELEPHONE NUMBER 
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I 

I number of citations (hits) sent, and the user's evaluation and 
comments concerning these citations. They are filed alpha - 
numerically by profile number in groups according to data base 
volume. These reports are reviewed weekly to assist in maintenance 
and modifications of each user's profiles. 

Statistics relating to profile hit relevance are prepared 
from these sheets. Evaluation data from these sheets are recorded 
in the Profile Hit Evaluation Record file. (See Figure 10-12.) 

10*2.9 Abstract Number Card File 

These cards are the trailer cards from each CSC profile out- 
put. Recorded on each card is the user evaluation of the hits for 
his profile. These evlauations are reviewed to assist in profile 
maintanance and updating procedures. These 5" x 8" cards are 
filed by profile number in data base volume groups. 

10.2.10 Profile Deck Modifications Fi le 

These 8" x 11" sheets are completed weekly as new profiles or 
profile modifications are generated. They accompany the profile 
keypunch cards and are checked at time of DKEDIT to assure all new 
modified or dropped profiles are in the system. These sheets are 
filed according to data base by volume and issue number. (See Figure 10-8.) 

10.2.11 CSC Profile DROP Checklist 

These 8V x 11" sheets are prepared for each profile to be 
dropped from the system. After the various "drop" procedures are 
completed, this sheet is filed in a folder in corporate code order 
in the drawer with the dropped profiles. 

10.2.12 CSC Billing Forms 

Billing Forms are prepared for each profile. They are maintained 
in alphabetic order by company and are cross referenced to the 
billing number, (see Figure 10-13.) 

10*2.13 Subscription Card File 

This 5 x 8 card file, organized alphabetically by company 
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COMPUTER SEARCH CENTER 

I IT RESEARCH INSTITUTE 



Billing Request U 
Page of 



Company 








Address 




PROJFtfT U 




Date 




Attest ion 




Purchase Order 


DESCR 1 FT 1 ON OF SERVICE 






♦CURRENT AWARENESS 


Subscription to 






Prof i le * 


Time Perioo / 


to 


i 


New Subscription 


MO. YR. 




MO. yr. 


Bas'c c or Category 


$ 






Supflfwen-al Output Units (Q $ 








SuP°Lb VENT AL Tcrw UNITS @ $ 










Total Fee for Prof ile 




s 










Previous Subscription 








=J N-r.~ Ci-ationS co> 


for the Perioo / 


TO 


/ 




MO. YU • 




MO. YR. 




Total Additional Charges 




$ 










RETROSPECTIVE SEARCH 


of 






! T IV: PiC'OD C "0V / TO 


/ 






M6. yr. ™U6. 

•Co- r~*\\ VOLUVt(s) 


1 YR. M 






CEAR.i- 'ut:T '* N U° '0 


TERMS 








Total Fee for Search 




$ 



•NO'Ei ExTNA OUTPUT CANOS IN EXCESS Or .. 

BE CHANCED AT A NATE Or $ 05 'l l C ABO TO » BILL o 4Vt " 4Ct ° ° V " THt SUBSCH I P T * ON PER ' **0 *, 

>.05 PER CARO TO BE BILLED AT THE END Or THE SUBSCRIPTION PIRIOC. 
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Figure 10-13 
BILLING FORM 
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name, gives subscription and billing information including 
coverage, starting and terminating dates, etc. (See Figure 10-14.) 

10 . 2 . 14 Profile Subject Index 

Profile subject categories are entered on 5"x 8" white 
cards along with their related profile number(s). Term cross- 
indexing has been done for major subject categories. This file, 
arranged alphabetically by subject, is used by the CSC staff 
in profile preparation to locate profile questions with similar 
subject coverage. (See Figure 10-15.) 

10.2.15 Form Masters and Supplies 

Master (reproducible) copies of all CSC forms are 
maintained in the CSC office. Supplies of these forms are 
kept in this office. When these supplies become diminished, 
the form is reviewed for possible updating or other revisions. 

A modified form is prepared when necessary, copies are made, 
and the "new" master is set aside in the file. 
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Figure 10-14 
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10.3 Tape Quality and Handling 

10.3.1 Tape Quality-Physical Characteristics 

The major difficulty in this respect is that of physical 
damage to the tapes. This results in an inability to read the 
tape into core. Other than obvious physical damage such as 

being run over by a truck (this has happened tire marks were 

visible), we have received tapes that were dirty and/or written 
with a tape drive that had mechanical defects which caused mis- 
alignment of bits (skew errors). Dirty tapes (a thin film is 
sufficient to damage tape) can sometimes be rescued by operator 
intervention and cleaning on the tape drive. This is a poor 
solution, since it causes the operator to interrupt processing 
of all jobs. About six of the 52 CA tapes issued in 1971 were 
damaged in this way. Three were corrected by cleaning, and 
three were replaced. We had two such tapes from El and none 
from BA. In 1972 we have received two damaged tapes from El. 

10.3.2 Tape Quality-Readable, Mis-recorded Information 

The second area in which we have experienced problems is 

that of wrong information on tapes. This includes machine- 
readable labels that do not correspond to paper labels. At 
present data base paper labels , as they are sent from suppliers , 
are inadequate. As a minimum, a label should denote: 

• tracks 

o recording density 
a reel number 
© number of files on tape 
® record and block size 
© supplier name and address 

o creation date & job number under which created 
a dataset name of each file 

The other errors of this type refer to data that are coded in- 
correctly, for example, directory entries that are wrong. CSC 
programs for conversion skip non-acceptable data, and if enough 
portions of a citation are garbled the entire citation 
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is skipped. In 1971 CA Condensates had about one bad record 
per 7000. It is now about 1 in 20,000. Since El and BA tapes 
are constructed differently, we process the errors as given, 
and so cannot tell how many have occurred for El. 

10.3.3 Tape Quality --Wrong Information 

The final category includes misspellings and similar 
errors for which we do not make checks. However, from such 
things as KLIC indexes, we do know that these errors occur. 

For example, of the 351 El Card-a-Lert codes found on the 1971 
tapes, 131 were spurious. Some 10% of the words in the alpha- 
betical term list for El are misspelled. BA and CA have less 
than 1% of this type of error, based on observation, not actual 
count. 

10.3.4 Tape Handling Procedures 

We have established the following procedures for detecting 
these conditions and converting data bases to IITRI- format. When 
tapes are received, they are logged in and sent to the data center. 
Here the paper label is checked, the appropriate format conver- 
sion program is chosen, and the JCL is prepared. The program 
is then run. If the conversion program runs properly, the 
output is checked for "bad" records. If these are few, the 
converted tape is used for the production run and the original 
is copied for backup storage. If the number was greater than 
fifty, we obtain a new tape from the supplier, returning the 
original and as much information as possible to inform the 
supplier of the errors. 

If the format conversion program does not run properly, 
there are four possible causes. 

(1) If the wrong JCL is employed, the entire tape 
would be unreadable. In this case we correct 
JCL and run again. 

(2) If the machine -readable label on the tape 
contained an error, the entire tape would be 
unreadable. In this case we dump the label 
and first few records. If the dump indicates 
a blank tape, we obtain a replacement. If it 
indicates a wrong density or mis -labelling 

we change our JCL to conform to the actual* data 
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and rerun the conversion program, 
we notify the supplier of errors. 

(3) If the tape contains dirt, oxide film, a crease, 
crinkle or other physical defect, the format 
conversion program will fail during processing. 
At times the tape is merely dirty and can be 
cleaned. This will be attempted. If salvage 

is not possible, we will obtain a replacement, 
and provide the supplier with documentation as 
to type and position of error. 

(4) A change in data base format may cause the 
format program to fail completely or to run 
but produce an incorrect conversion. Deter- 
mination of this kind of error requires a 
dump and analysis of the incoming tape, and 
the only remedy is to modify the conversion 
program to take into account the format change. 
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10.4 Production Statistics and Cost 

Detailed statistics have been collected on operations and 
costs associated with the Computer Search Center. Direct search 
costs are relatively easy to obtain inasmuch as these costs are 
derived from computer processing. Production statistics and 
costs for searches of CA, BA, and El are given in this section. 

10.4.1 Computer Time per Program 

The overall programming system is made up of five basic 
programs (Section 5). CSC monitors the system by continuously 
checking the amount of time (cost) and the relative percent 
time each of the individual programs expends in carrying out 
production runs. Data presented here have been normalized for 
purposes of comparison. Tables giving the percent of computer 
time for the four program functions: Format Conversion (data 

base input preparation) , Input (profile input preparation) , 
Search, and Output (preparation of output) are given in Tables 
10-1 through 10-13. The fifth program, Statistics generation, 
uses so little computer time that it was omitted from these 
tabulations. Data showing the same relationships have been 
graphed and are presented in Figures 10-16 through 10-24. 

The percentage of computer time is a relative number--as 
the percentage of one program decreases the percentages of 
the other programs increase. However, absolute cost of the 
entire system decreases as the cost for any individual pro- 
gram decreases. Examination of the percentages helps us deter- 
mine which portions of the system (programs or modules) we 
want to work on to further cut costs. Table 10-1 gives the 
average percent run times and ranges of percent run time for 
processing CA Volume 76. 
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Program Function 


Program or 
Module 




Average % 
Time 


Range % 
Time 


Data base input 
preparation 


FORCON 




10.90 


9.0-14.0 


Profile input 
preparation 


DKEDIT-MINIPUP 
IN PUT R 




1.56 

2.05 


1.0- 4.0 

2.0- 4.0 


Search (term match 
and logic eval- 
uation 


SEARCH 




75.06 


68.0-75.0 


Output preparation 


OCP 

PRINT 




6.38 

4.21 


5.0- 7.0 

3.0- 5.0 


Statistics genera- 
tion 


STIXA 




.17 


0.1- 0.5 


Private Libraries 
extraction 


PLSXT 




.57 


0.5- 1.0 




Table 


10- 


1 




PERCENT AND 


RANGE PERCENT COMPUTER TIME PER 


PROGRAM 


Table 10-2 displays for comparison the average percent 


computer time per 


program for CA 


Volumes 71 and 76 


• 


Program Function 


Program or 
Module 




Average % 
Time 
Vol . 71 


Average ; 
Time 
Vol. 76 


Data base input 
preparation 


FORCON 




16.71 


10.00 


Profile input 
preparation 


DKEDIT-MINIPUP 

INPUTR 


1.43 


3.61 


Search 

Output preparation 


SEARCH 

OCP-PRINT 




77.54 

4.32 


75.06 

11.16* 


Statistics genera- 
tion 


STIXA 




- 


.17 



^includes private libraries extraction--0 . 57 



Table 10-2 

AVERAGE PERCENT COMPUTER TIME PER PROGRAM 
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CHEMICAL 


, ABSTRACTS 


CONDENSATES VOLUME 


71 






Format 














Issue 


Conversion 


Input 


Search 




Output 






Time* 


% 


Time* % 


Time* 


X 


Time* 


% 








(Data for 


issue Nos. 


1 - 8 












does not exist) 








9 


5:48 


2.03 


10:11 3.56 


4:12:27 


88.30 


17:26 


6.10 


10 


11 :44 


2.56 


13.17 2.91 


6:21:13 


.83.50 


2 :4R 


11.04 


11 


4:55 


1.06 


9:22 2.02 


6:46:28 


87.64 


2:23 


9.28 


12 


5:46 


1.11 


8:28 1.63 


7:31:13 


86.84 


54:09 


10.42 


13 


14:00 


3.07 


21:08 4.58 


6:38:22 


85.69 


30:44 


6.66 


14 


16:05 


3.41 


9:15 1.96 


6:39:21 


84.68 


46:55 


9.95 


15 


18:43 


7.91 


18:31 7.82 


3:11:21 


80.84 


8:07 


3.43 


16 


17:13 


9.39 


5:24 2.95 


2:28:16 


80.89 


12:23 


6. 76 


17 


20:43 


10.13 


9.38 4.71 


2:22:54 


69.85 


31:19 


15.31 


18 


Test 


- 


- 


- 


- 


- 


• 


19 


13:36 


7.14 


8:29 4.45 


1:46:33 


55.93 1 


:01 : 51 


32.47 


20 


17:28 


8.51 


6:19 3.08 


2:22:27 


69.42 


38:58 


18.99 


21 


11:45 


5.73 


7:55 3.86 


2:14:53 


65.73 


50:39 


24.68 


22 


15:11 


5.63 


6:19 2.34 


2:59:23 


66.51 1 


: 08:50 


25.52 


23 


8:28 


4.15 


11:19 5.96 


2:41:03 


84.81 


10:45 


5.66 








(Data for issue Nos. 


24 - 26 












do not exist) 








Odd 


12:01 


5.15 


11:09 4.62 


3:43:53 


77.35 


26:39 


12.95 


Even 


13:54 


5.10 


8:08 2.48 


4:43:39 


8.64 


37:21 


13.78 


Total 


12:49 


5.13 


9:52 3.70 


4:19:26 


77.90 


31:14 


13.31 


Time* 


= Normalized time 


i = Actual 


Cost of Operation 







cpu charge 



Table 10-3 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM 

VS 0 ISSUE 
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Format 

Conversion 


Input 




Search 




Output 




Issue 


Time* 


% 


Time* 


7c 


Time* 


7c 


Time* 


% 


1 


13:35 


7.66 


11:51 


6.68 


2:20:07 


79.03 


11:45 


6.63 


2 


19:54 


10.35 


10:54 


5.67 


2:31:24 


78.73 


10:06 


5.25 


3 


13:39 


7.65 


12:46 


7.15 


2:13:38 


74.86 


18:27 


10.34 


4 


18:58 


9.26 


11:43 


5.72 


2:43:38 


79.86 


10:34 


5.16 


5 


13:44 


8.62 


12:27 


7.82 


2:00:22 


75.51 


12:40 


7.95 


6 


27:12 


8.88 


13:41 


4.47 


4:05:17 


80.08 


20:07 


6.57 


7 


14:40 


7.85 


13:06 


7.01 


2:22:34 


76.28 


16:34 


8.86 


8 


24:04 


7.88 


14:14 


4.66 


4:09:09 


81.56 


17:57 


5.88 


9 


18:15 


7.49 


12:28 


5.12 


3:09:31 


77.80 


23:22 


9.59 


10 


20:37 


6.80 


12:55 


4.26 


4:12:53 


83.38 


16:55 


5.58 


11 


17:26 


7.14 


13 : 19 


5.45 


3:12:23 


78.78 


21:04 


8.63 


12 


23:19 


6.78 


14:43 


4.28 


4 :46 : 05 


83.21 


19:42 


5.73 


13 


10:25 


4.58 


14:03 


6.18 


3:02:13 


80.13 


20:43 


9.11 


14 


24:28 


6.28 


15:52 


4.07 


5:27:02 


83.92 


22:20 


5.73 


15 


16:18 


6.56 


14:23 


5.79 


3:16:19 


79.03 


21:25 


8.62 


16 


25:52 


6.55 


2:17 


0.58° 


5:45:58 


87.63 


20:41 


5.24 


17 


18:45 


7.93 


1.48 


0.76 


3:14:56 


82.46 


20:55 


8.85 


18 


21:27 


8.10 


1:50 


0.19 


3:49:19 


86.57 


12:17 


4.64 


19 


18:43 


8.36 


1:43 


0.77 


3:04:58 


82.65 


18:24 


8.22 


20 


24:35 


9.20 


2:29 


0.93 


3:55:51 


84.12 


15:22 


5.75 


21 


15:10 


9.56 


1:58 


1.24 


1:59:24 


75.24 


22:09 


13.96 


22 


18:26 


14.67 


1:55 


1.52 


1:32:48 


73.82° 


12:33 


9.99 


23 


23:05 


21.44 


22:17 


2.12° 


1:04:59 


60.34 


17:20 


16.10 


24 


27:38 


20.94 


2:15 


1.71 


1:39:07 


75.09 


2:59 


2.26 


25 


14:08 


14.73 


:33 


2.65 


1:15:03 


78.13 


4:19 


4.49 


26 


29:49 


21.89 


2:10 


1.59 


1:41:02 


74.18 


3:11 


2.34 


Odd 


15:59 


9.20 


8:49 


4.52 


2:28:58 


76.95 


17:37 


8.81 


Even 


23:34 


10.58 


8:14 


3.09 


3:32:58 


80.94 


14:13 


7.11 


Total 


19:46 


9.89 


8:31 


3.80 


3:00:58 


78.94 


15:55 


7.96 


Time* 


= Normalized 


time - 


Actual 


Cost of 


Operation 





cpu charge 

o = Major Modifications 
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I 

ERjeJ 





Format 

Conversion 


Input 




Search 




Output 




sue 


Time* 


7. 


Time* 


7. 


Time* 


7. 


Time* 


% 


i 


9:14 


11.45 


2:26 


3.02 


1:05:12 


80. 79 


3:50 


4.74 


2 


28:27 


17.79 


2:49 


1.76 


2:04:19 


77. 75 


4:19 


2.70 


3 


21:42 


15.79 


2:43 


1.98 


1:47:57 


78.57 


5:02 


3.66 


4 


15:40 


11.35 


2:29 


1.80 


1:55:49 


83.93 


4:02 


2.92 


5 


21:43 


17.65 


2:23 


1.94 


1:34:28 


76.80 


4:26 


3.61 


6 


22:31 


21.57 


2:12 


2.10 


1:16:37 


73.38 


3:05 


2.95 


7 


17:27 


16.38 


2:21 


2,21 


1:12:08 


77.12 


4:02 


3.79 


8 


25:49 


25.38 


2:09 


2.11 


1:09:58 


68 . 79 


3:47 


3.72 


9 


19:39 


20.27 


2:17 


2.36 


1:10:30 


72.76 


4:28 


4.61 


10 


24:50 


23.32 


2:08 


2.00 


1:15:43 


71.09 


3:49 


3.59 


11 


15:25 


15.11 


2:23 


2.33 


1:19:06 


77. 55 


5M3 


5.11 


12 


19:56 


19.04 


2:18 


2.19 


1:18:33 


75.02 


3:56 


3.75 


13 


17:49 


13.75 


2:22 


1.82 


1:44:26 


80. 56 


4:59 


3.85 


14 


22:36 


21.64 


2:02 


1.94 


1:15:44 


72 . 54 


4:03 


3.88 


15 


20:08 


19.07 


2:17 


2.17 


1:18:02 


73.90 


5:08 


4.86 


16 


23:19 


20.14 


2:10 


1.87 


1:25:06 


73.49 


5:13 


4.50 


17 


15:52 


13.63 


2:15 


1.94 


1:32:36 


79.56 


5:40 


4.87 


18 


25:28 


22.46 


2:01 


1.77 


1:20:19 


70.82 


5:37 


4.95 


19 


14:23 


11.72 


2:22 


1.93 


1:39:34 


81.14 


6*24 


5.21 


20 


26:25 


21.69 


2:09 


1.76 


1:27:00 


71.43 


6: 14 


5.12 


21 


21:48 


15.94 


2:18 


1.68 


1:45:57 


77.45 


6:45 


4.93 


22 ° 


21:45 


18.69 


2:07 


1.82 


1:25:37 


74.56 


6:16 


5.38 


2300 


17:29 


13.43 


2:25 


1.86 


1:49:59 


79.09 


6:42 


5.15 


24 


23:59 


20.86 


2:17 


1.99 


1:22:26 


71.93 


5:59 


5.22 


25 


15:29 


12.17 


2:25 


1.90 


1:42:04 


80.24 


7:14 


5.69 


26 


27:20 


22.89 


2:13 


1.85 


1:24:05 


70.42 


6:15 


5.23 


Odd 


17:33 


15.10 


2:23 


2.13- 


1:30:55 


78.12 


5:23 


4.62 


Even 


23:42 


20.52 


2:14 


1.92 


1:26:15 


73.40 


4:49 


4.15 


Total 

Time* 


20:38 17.81 

= Normalized 


2:19 
time = 


2.02 
Ac tual 


1:28:35 
Cost of 


75.76 5:06 

Operation 


4.38 



6 FORCON Includes 
oo Output Includes 



CACOPY from this point on 
PRLXT from this point on 
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Format 

Conversion 


Input 




Search 




Output 




Issue 


Time* 


7. 


Time* 


7. 


Time* 


7. 


Time* 


% 


1 


17:41 


16.11 


2:26 


2.21 


1:19 


05 


72.03 


10:36 


9.65 


2 


29:04 


24.53 


2:14 


1.88 


1:20 


24 


67.85 


6:48 


5.74 


3 


14:02 


11.84 


2 '.23 


2.01 


1:35 


22 


80.48 


6:43 


5.67 


4 


27:21 


21.81 


2:18 


1.83 


1:29 


31 


71.38 


6:15 


4.98 


5 


19:45 


15.56 


2:00 


1.57 


1:38 


22 


77.52 


6:11 


4.87 


6 


24:23 


23.49 


2:20 


2.25 


1:11 


49 


69.18 


5:16 


5.08 


7 


14:27 


13.24 


2:29 


2.27 


1:25 


54 


78.67 


6:21 


5.82 


8 


34:51 


22.78 


2:44 


1.79 


1:47 


33 


70.29 


7: 52 


5.14 


9 


22:57 


16.93 


2:30 


1.84 


1:43 


19 


76.19 


6; 49 


5.03 


10 


34:08 


23.51 


1:40 


1.15 


1:41 


44 


70.06 


7:40 


5.28 


11 


22« 57 


16.96 


2:36 


1.92 


1:42 


09 


75.50 


7:36 


5.62 


12 


31: ^9 


21.58 


2; 26 


1.64 


1:45 


52 


71.43 


7:56 


5.35 


13 


19:12 


11.23 


2:43 


1.59 


2:22 


06 


83.10 


6:59 


4.08 


14 


43:15 


23.10 


2:41 


1.43 


2:11 


10 


70.70 


10:07 


5.40 


15 


25: 56 


10.86 


2:42 


1.13 


3:21 


16 


84.28 


8:54 


3.73 


16 


34:40 


19.36 


2:42 


1. 51 


2:13 


49 


74.72 


7: 54 


4.41 


17 


27:34 


14.05 


2:39 


1.35 


2:37 


39 


80.35 


8:20 


4.25 


18 


28:06 


17.67 


2:42 


1.70 


2:00 


53 


76.03 


: 7s 19 


4.60 


19 


29:36 


14.34 


2:43 


1.32 


2:43 


50 


79.38 


10:14 


4.96 


20 


28: 16 


16.59 


2:46 


1.62 


2:11 


38 


77.25 


7:43 


4.53 


21 


26:32 


14.13 


2:45 


1.46 


2:29 


09 


79.42 


9:22 


4.99 


22 


33:54 


17.52 


2:48 


1.45 


2:27 


22 


76.16 


S:25 


4.87 


23 


28:00 


15.25 


2:40 


1.45 


2:23 


32 


78.18 


9:24 


5.12 


24 


43:54 


19.03 


2:53 


1.25 


2:53 


51 


75.36 


10:04 


4.36 


25 


38:39 


17.13 


2:40 


1.18 


2:52 


24 


76.42 


11:52 


5.26 


26 


45:26 


20.63 


3:02 


1.38 


2:40 


59 


73.11 


10:43 


4.87 


Odd 


23:38 


14.43 


2:34 


1.64 


2:10 


19 


78.58 


8:05 


5.31 


Even 


33:48 


20.89 


2:35 


1.61 


1:59 


44 


72.53 


8:05 


4.97 


Total 28:43 


17.66 


2:35 


1.63 


2:04 


58 


75.56 


8:05 


5.14 


Time* 


= Normalized 


time = 


Ac tual 


Cost of 


Operation 





cpu charge 
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Format 



Issue 


Conversion 
Time* % 


Input 

Time* 


% 


Search 

Time* 


% 


Output 

Time* 


% 


1 


12:53 


7.40 


2:45 


1.58 


2:27:47 


84.93 


10:36 


6.09 


2 


15:40 


9.76 


2:43 


1.69 


2:11:44 


82.08 


10:25 


6.49 


3 


11:36 


6.59 


2:03 


1.16 


2:31:10 


85.84 


11:16 


6.40 


4 


17:58 


10.24 


2:02 


1.16 


2:22:01 


80.92 


13:29 


7.68 


5 


14:10 


8.06 


2:30 


1.42 


2:27:20 


83.81 


11:48 


6.71 


6 


15:59 


9.92 


2:45 


1.71 


2:10:40 


81.11 


11:42 


7.26 


7 


10:42 


8.02 


2:35 


1.93 


1:51:13 


83.31 


9:00 


6.74 


8 


17:25 


9.84 


2:44 


1.54 


2:25:25 


82.16 


11:27 


6.47 


9 


12:02 


7.44 


2:37 


1.62 


2:16:31 


84.43 


10:31 


6.50 


10 


14:05 


9.95 


2:45 


1.94 


1:54:43 


81.02 


10:02 


7.08 


11 


11:46 


7.25 


2:32 


1.56 


2:16:55 


84.36 


11:04 


6.82 


12 


13:17 


10.25 


2:29 


1.91 


1:44:20 


80.50 


9:31 


7.34 


13 


9:40 


9.68 


2:22 


2.37 


1:19:04 


79.14 


7:45 


7.76 


14 


9:36 


11.77 


2:20 


2.85 


1:03:05 


77.31 


6:34 


8.05 


15 


9:42 


9.31 


3:23 


3.25 


1:22:46 


79.51 


8:15 


7.93 


16 


12:54 


11.62 


2:31 


2.26 


1:28:02 


79.31 


7.34 


6.81 


17 


12:10 


10.96 


2:29 


2:23 


1:27:35 


78.90 


8.47 


7.91 


18 


10:01 


10.84 


2:33 


2.76 


1:13:01 


79.02 


6.43 


7.27 


19 


6:30 


10:82 


2:17 


3.81 


0:44:00 


73.34 


7:13 


12.03 


20 


15:37 


11.86 


2:36 


1.97 


1:42:44 


78.00 


10:46 


8.17 


21 


8:00 


10.84 


2:11 


2,96 


0:57:01 


77.26 


6:36 


8.94 


22 


13:20 


11.89 


2:17 


2.03 


1:27:05 


77.61 


9:30 


8.47 


23 


8:59 


10.85 


2:24 


2.89 


1:04:12 


77.54 


7:12 


8.70 


24 


13:32 


11.39 


2:28 


2.07 


1:32:22 


77.75 


10:27 


8.79 


25 


9:45 


11:13 


2:17 


2.61 


1:07:01 


76.50 


8.33 


9.76 


26 


12:57 


11.63 


1:56 


1.74 


1:24:54 


76.28 


11:31 


10.34 


Odd 


10:37 


9:10 


2:30 


2.26 


1:40:58 


80.68 


9:07 


7.87 


Even 


14:01 


10:84 


2:28 


1.97 


1:30:47 


79.47 


9:59 


7.71 


Total 


12:19 


9.97 


2:29 


2.12 


1:35:53 


80.75 


. 9:32 


7.79 



Time* = Normalizeu Lime = Actual Cost of Operation 

cpu charge c 
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Format 





Conversion 


Input 




Search 




Output 




Issue 


Time* 


% 


Time* 


% 


Time* 


% 


Time* 


7» 


1 


7:22 


11.12 


1:59 


2.99 


50:18 


75.92 


6:36 


9.97 


2 


10:15 


13.88 


2:01 


2.73 


54:21 


73.64 


7:11 


9.73 


3 


7:12 


8.95 


2:17 


2.13 


1:06:57 


83.25 


3:59 


4.95 


4 


9:35 


9.00 


2:21 


2.21 


1:24:20 


79.17 


10:15 


9.-61 


5 


7:27 


9.98 


2:23 


2.95 


59:48 


79.72 


5:21 


7.35 


6 


7:54 


8.99 


3:16 


3.71 


1:08:23 


77.81 


8.20 


9.48 


7 


6:54 


9.52 


2:22 


3.26 


57:13 


78.92 


6:02 


8.02 


8 


12:49 


10.29 


2:31 


2.02 


1:39:43 


80.17 


9:20 


7.51 


9 


8:02 


9.66 


2:22 


2.84 


1:05:24 


78.65 


7:21 


8.84 


10 


13:15 


9.88 


2:17 


1.70 


1:44:22 


77.87 


14:08 


10. 54 


11 


8:46 


8.20 


2:43 


2.54 


1:26:06 


80.53 


9:20 


8. 72 


12 


13*02 


10.22 


2:14 


1.75 


1:41:12 


79.39 


11:00 


8.63 


13 


8:18 


9.84 


2:20 


2.76 


1:05:54 


78.14 


7:48 


9.25 


14 


12:34 


*•9.57 


2:19 


1.76 


1:43 : 46 


79.10 


12:32 


9.55 


15 


8:17 


8.54 


3:58 


4.09 


1:16:32 


78.90 


8:13 


8.47 


16 


12:32 


9.04 


3:27 


2.52 


1:47:58 


78.76 


13:17 


9.68 


17 


8:33 


8.62 


3:12 


3.23 


1:18:20 


79.00 


9:05 


9.16 


18 


12:00 


11.98 


2:33 


2.54 


1:13:07 


73.00 


12:30 


12.48 


19 


9:03 


12.20 


2:36 


3.50 


52:30 


70. 79 


10:01 


13.52 


20 


12:04 


12.96 


2:35 


2.56 


1:10:38 


70.01 


15:36 


15.43 


21 


9:13 


12.31 


2:30 


3.34 


53:12 


71.09 


9:55 


13.25 


22 


12 . 33 


12.80 


2:14 


2.28 


1:10:41 


72.06 


12:37 


12.86 


23 


9:01 


12.71 


2:05 


2.93 


49. 55 


70.37 


9:55 


13.98 


24 


12:29 


13.16 


2:28 


2. 59 


1:07:23 


7.0.52 


13:13 


13.83 


25 


10:22 


16.08 


2:12 


3.41 


42:24 


65.75 


9:31 


14.76 


26 


13:54 


13.41 


2:28 


2.39 


1:11:42 


69.18 


15:35 


15.03 


Odd 


8:17 


10.16 


2:35 


3.17 


1:02:56 


77.16 


7:46 


9.52 


Even 


11:40 


10.78 


2:23 


2.34 


1:25:19 


76.45 


11:31 • 


10.50 


Total 


9:59 


10.37 


2:29 


2.58 


1:14:08 


77:02 


9:39 


10.03 



Time* = Normalized time = Actual Cost of Operation 

cpu cnarge 
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Issue 


Format 

Conversion 

Time* % 


Input 

Time* 


% 


1 


8:15 


6.95 


1:04 


0.91 


2 


7:59 


6.25 


0:00 


0.00 


3 


7:47 


7.77 


0:55 


0.92 


4 


8:05 


8.52 


1:01 


1.07 


5 


8:13 


7.58 


1:02 


0.96 


6 


8:38 


7.53 


0:00 


0.00 


7 


7:48 


4.53 


0:00 


0.00 


8 


8:49 


7.83 


1:10 


1.04 


9 


8:12 


6.83 


1:07 


.93 


10 


8:10 


7.93 


1:15 


1.21 


11 


7:51 


6.27 


1:04 


0.85 


12 


8:15 


6.86 


0:00 


0.00 



Search 




Output 




Time* 


% 


Time* 


% 


1:47:07 


90.17 


3.09 


2.65 


1:57:31 


91.95 


2:18 


1.80 


1:29:29 


89.31 


2:00 


2.00 


1:22:34 


87.09 


3:09 


3.32 


1:35:39 


88.32 


3:31 


3.24 


1:42:13 


89.19 


3:46 


3.28 


2:37:10 


91.27 


7:13 


4.19 


1:36:54 


86.13 


5:38 


5.00 


1:44:38 


87.20 


6 : 04 


5.05 


1:27:48 


85.32 


5:42 


5.54 


1:51:26 


89.07 


4:47 


3.82 


1:45:51 


87.99 


6:11 


5.14 



O 

ERIC 



Total 



8:10 7.07 0:43 

Time* = Normalized 



0.66 1:44:52 88.58 4:27 3.75 

Time = Actual Cost of Operation 
cpu charge 
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Input 
Time* % 



Search 
Time* % 



Output 
Time* 7„ 



(Data for 1-6 do not exist) 



Format 

Conversion 

Issue Time* % 

1 

2 

3 

4 

5 

6 



7 


6:00 


14.30 


0:52 


2.08 


8 


6:04 


15.54 


0:00 


0:00 


9 


5:18 


17.66 


0:00 


0:00 


10 


5:19 


7.52 


0:57 


1.35 


11 


5:19 


7.91 


1:05 


1.61 


12 


5:26 


8.32 


0:47 


1.21 


13 


5:57 


9.98 


0:56 


1.57 


14 


5:25 


9.39 


1:50 


3.19 


15 


5:46 


8.66 


1:01 


1.69 


16 


5:21 


10.38 


0:59 


1.89 


17 


5:03 


9.20 


0:58 


1.77 


18 


5:15 


9.77 


1:00 


1.87 


19 


5:14 


6.38 


1:04 


1.30 


20 


5:01 


6.69 


0:49 


1.08 


21 


5:02 


6.69 


1:06 


1.47 


22 


4:58 


6.23 


1:08 


1.42 


23 


5:09 


5.96 


1:12 


1.40 


24 


4:52 


6.32 


0:00 


0.00 



33:58 


80.89 


1:09 


2.73 


31:52 


81.69 


2:11 


3.61 


23:43 


79.06 


0:59 


3.28 


1:03:10 


89.22 


1:21 


1.91 


59:27 


88.47 


1:15 


1.88 


57:40 


88.17 


1:30 


2.30 


51:30 


86.26 


1:18 


2.19 


49:58 


86.76 


1:15 


2.18 


52:03 


87.18 


1:28 


2.47 


43:32 


84.37 


1:44 


3.36 


47:01 


85.65 


1:51 


3.38 


45:44 


85.17 


1:42 


3.19 


1:13:43 


90.01 


1:54 


2.31 


1:07:23 


89.85 


1:47 


2.38 


1:05:11 


86.57 


3.58 


5.27 


1:10:33 


88.19 


3.24 


4.26 


1:16:46 


88.85 


3.16 


3.79 


1:09:09 


89.68 


3:05 


4.00 



Total 



5:22 9.27 0:52 1.38 54:34 86.45 1:57 3.03 

Time* = Normalized Time = Actual Cost of Operation 

cpu charge 
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Format 


















Conversion 


Input 




Search 




Output 




Issue 


Time* 


% 


Time* 


% 


Time* 


% 


Time* 


% 


1 


5:00 


5.85 


0:00 


0.00 


1:16:59 


39.72 


3:48 


4.43 


2 


5:14 


5.14 


1:10 


1.15 


1:31:48 


90.26 


3:29 


3.43 


3 


5:14 


6.06 


1:04 


1.23 


1:17:02 


89.15 


3:05 


3.56 


4 


5:02 


5.25 


1:06 


1.15 


1:26:46 


90.39 


3:06 


3.23 


5 


5:54 


7.48 


0:54 


1.14 


1:10:04 


88.80 


2.-02 


2.58 


6 


5:55 


7.47 


0:40 


0.84 


1:10:27 


88.96 


2:10 


2.73 


7 


6:10 


7.47 


0:56 


1.14 


1:13:25 


88.99 


1:59 


2.40 


8 


5:56 


7.28 


0:00 


0.00 


1:13:28 


90.24 


2:01 


2.48 


9 


6:13 


7.37 


0:59 


1.16 


1:14:32 


88.42 


2:34 


3.05 


10 


5:59 


6.40 


0:00 


0.00 


1:24:44 


90.81 


2:36 


2.79 


11 


6:42 


6.12 


1:02 


0.95 


1:36:48 


88.40 


4:57 


4.52 


12 


6:31 


5.39 


0:00 


0.00 


1:48:15 


89.53 


6:39 


5.09 


13 


6:03 


5.93 


1:18 


1.27 


1:28:37 


86.88 


6:02 


5.92 


14 


6:19 


5.43 


0:00 


0.00 


1:43:58 


89.32 


6:06 


5.24 


15 


















16 


6:34 


5.89 


0:50 


.75 


1:38:35 


88.34 


5:19 


4.76 


17 


6:21 


6.37 


1:06 


1.11 


1:26:59 


87.33 


5:10 


5.18 


18 


6:36 


6.06 


0:00 


0.00 


1:37:12 


89.26 


5:05 


4.67 


19 


5:23 


7.50 


0:00 


0.00 


1:01:45 


86.12 


4:34 


6.37 


20 


6:40 


6.19 


0:00 


0.00 


1:35:47 


88.93 


5:15 


4.88 


21 


6:39 


5.89 


1:13 


1.08 


1:39:28 


88.18 


5:29 


4.86 


22 


6:09 


5.63 


1:09 


1.05 


1:36:19 


88.20 


5:29 


5.02 


23 


6:06 


6.04 


0:00 


0.00 


1:29:54 


88.93 


5:05 


5.02 


24 


6:21 


6.20 


1:04 


1.04 


1:30:31 


88.48 


4:23 


4.28 


:al 


6:03 


6.28 


0:38 


0.65 


1:26:41 


88.84 


4:11 


4.20 



Time* - Normalized Time = Actual Cost of Operation 

cpu charge 
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Format 





Conversion 


Input 




Search 




Output 




i Issue 


Time* 


% 


Time* 


7o 


Time* 


% 


Time* 


% 


1 1 


7:01 


43.35 


0:27 


2.79 


7:14 


44.68 


1:29 


9.18 


1 2 


17:44 


47.30 


6:14 


16.60 


4:48 


12.80 


8:44 


23.3 


i 3 

4 ° 

! 50 

6 


9:12 


40.34 


0:33 


2.42 


10:17 


45.11 


2:46 


12 . 12 


9:28 


38.00 


0:00 


0.00 


12:22 


49.67 


3:04 


12.33 


| 7 


9:31 


34.50 


0:34 


2.05 


13:40 


49.54 


3:50 


13.90 


8 


9:33 


33.50 


0:36 


2.13 


14:25 


50.59 


3:56 


13.78 


1 9 


9:45 


31.53 


0:00 


0.00 


16:45 


54.19 


4:25 


14.27 


» 10 


9:22 


24.97 


0:44 


1.96 


20:36 


54.94 


6:48 


18.14 


! 11 


9:34- 


31.25 


0:38 


2.06 


15:58 


52.20 


4:26 


14.49 


I 12 


9:03 


11.47 


1:31 


1.93 


59:08 


74.95 


9:12 


11.65 



I 

1 

I 

I 

I 



O 

ERIC 



Total 



33.62 



3.19 



48.87 



14.32 



Time* - Normalized Time = Actual Cost of Operation 
0 _ _ cpu charge 

Data for issues 4-5 do not exist 
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! 

I 

J 

J 

T 

J 

J 

J 

J 

I 

I 

I 

I 

I 



Issue 


Format 

Conversion 

Time* % 


Input 

Time* 


7o 


Search 

Time* 


% 


Output 

Time* 


% 


1 


7:53 


17.39 


1:16 


2.81 


30:31 


67.37 


5:38 


12.43 


2 


4:41 


10.46 


1:02 


2.32 


34:16 


76.67 


4:43 


10.54 


3 


5:41 


12.88 


1:33 


3.53 


31:45 


72.00 


5:07 


11.59 


4 


5:41 


10.83 


1:18 


2.47 


39:11 


74.65 


6:20 


12.05 


5 


7:47 


12.36 


1:01 


1.62 


46:59 


74.57 


7:13 


11.45 



Total 



6:10 12.78 1:14 2.55 36:32 73.05 5:48 11.61 



Time* = Normalized Time = Actual Cost of Operation 

cpu charge 

Table 10-13 
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Figure 10-16 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VS. ISSUE 
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Figure 10-17 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VS, ISSUE 
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Figure 10-18 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VSL ISSUE 
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Figure 10-20 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VSi, ISSUE 
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Figure 10-22 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VS. ISSUE 
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Figure 10-23 

PERCENT OF COMPUTER TIME PER COMPUTER PROGRAM VS. ISSUE 
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The cost of running the programs is naturally dependent on 
the size and number of input profiles, the data base, the num- 
ber of terms, the frequency with which the terms appear in the 
issue searched, the number of citations in the issue searched, 
the number of near hits (i.e., citations that matched profile 
terms but were subsequently disqualified on the basis of logic 
or weights etc.), the number of hits obtained by the profiles, 
and the number of citations printed. 

The two most significant determinants of cost are the size 
of the data base and the number of profile terms . CSC has de- 
veloped a formula whereby given the number of profile terms and 
the number of citations in a data base we can predict the cost 
of the run to within 10%. The cost for searching one profile 
term against one CA Condensates citation is $1.05 X 10"^ based 
on a total term list of 500-5000 words. The CSC Constant Com- 
puter Cost Factor is: 

$1.05 X 10"^/profile-term/citation. 

The search portion of the system is the prime determinant-- 
other factors such as number of hits, complexity of logic, 
number of hits printed, etc. account for the 10% variation. 

10.4.2 CSC Time and Cost Summary Sheet 

The CSC Time and Cost Summary Sheet is prepared for each 
issue of each data base processed. It is color coded for CA 
(white), BA (green), and El (yellow). These are attached as 
Figures 10-25, 10-26, and 10-27. The time figures recorded on 
these sheets are obtained from the computer printout listings 
for each of the programs. A sample of the printout listing of 
the INPUTR program for a production run of CA Condensates is 
attached as Figure 10-28. Percentage of total time and cost 
figures are calculated. The cost figure is obtained from CPU 
time, core size and current computer rates. The statistics 
calculated on the following page require input from the computer- 
generated Production Run Summary-Computer Search Center , which 
is discussed below. The Time and Cost Summary contains the 
following : 



• date, data base, volume and issue 

• time in seconds and in hours, minutes and seconds 
for each program (and any reruns that are necessary) 

• percentage of total time used by each program 

• cost per program 

• total time and cost for all programs and reruns 

• time and cost per profile 

• time and cost per term (profile term) 

• time and cost per hit 

• time and cost per term/per citation 

• cost average to data (begun anew with each volume) 
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Statistics Recorded 

Date of Run 
Tape Service 
Volume - Issue 



PROGRAM _SEC HH: MM;SS 

CACOPY : 

FORCON : j 

DXE01T ; : 

INPUTR 

SEARCH ; ; 

CACARD ; _ 

STIXA 

OCP ; 

PRINT ; . 

PRILIU 



% _ _ _ _ _ COST __ RERUNS 

TOTAL 

MO. TIME 



TOTAL 



100.00 



£ 



Additional Cost due to Reruns jS 



Total Cost 



Statistics Calcul at ed 



Time 

Time 

Time 

Time 



& c °st per Profile 
& Cost per Term 
& Cost per Hit 

& Cost per Term per Citation 



sec. 

sec. 

sec. 

sec. 



£ 

£ 

£ 

£ 



Cost Average to Date 



Figure 10-25 

COMPUTER SEARCH CENTER TIME AND COST SUMMARY SHEET 
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Statistics Recorded 

Date of Run 
Tape Service 
Volume - Issue 



PROGRAM SEC. HH;MM:SS 



COST 



BACOPY 

FORBAP 

DKEDIT 

INPUT'R 

BASRCH 

BAPFORM 

STIXA 

OCP 

PRINT 

PRILIB 

TOTAL 



100.00 $ 



RERUNS 

TOTAL 

NO, TIME 



Additional Cost due to Reruns $ 



Total Cost 



Statistics Calculate* ri 



Time & Cost per Profile 

Time & Cost per Term 

Time & Cost per Hit 

Time & Cost per Term per Citation 



sec. 

sec. 

sec. 

sec. 



£ 

£ 

£ 

$ 



Cost Average to Date 



Figure 10-26 

COMPUTER SEARCH CENTER TIME AND COST SUMMARY SHEET 
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Statistics Recorded 



Date of Run 
Tape Service 



Volume - Issue 

I 



El COMPENDEX 
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P ROGRAM SEC . HH : MM: SS % _ COST RERUNS 

TOTAU 

MO. TIME 

EICOPY 
EICON 
DKEDIT 
INPUTR 
EISRCH 
EICARD 
STIXA 
EIOCP 
PRINT 
PRILID 



T0TAL : i_ 100.00 £ 




Additional Cost due to Reruns S 



Total Cost 



Statistics Calculate 

Time & Cost per Profile 

Time & Cost per Term 

Time & Cost per Hit 

Time & Cost per Term per Citation 

Cost Average to Date 



sec, 



sec. 



sec. 



sec. 



i. 

$ 



L 



Figure 10-27 

COMPUTER SEARCH CENTER TIME AND COST SUMMARY SHEET 
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TIME RECORD FROM CA PRODUCTION INPUTR RUN 



10.4.3 Production Run Summary- -CSC 



I 

! 

o 

ERLC 



The Production Run Summary--CSC is a machine gener- 
ated summary of statistics for each issue of each data base. 

It includes three sections. The first page (see Figure 10-29) 
contains : 

• date, data base, volume, issue and number of citations 

• number of profiles and number of in-house profiles 
f number and mean of input terms 

• number and mean of aggregated terms 

• percent term reduction by aggregation 

• number of hits (in-house, others and total) 
and means, both recorded and printed 

• number and mean of unique citations retrieved 

• number and mean of cards printed 

• hits recorded and hits printed per citation retrieved 

• range and median of hits generated 

• range of hits printed 

• number of profiles getting no hits 

The second page (more than one page is printed if necessary) 
gives the distribution of hits by profile (see Figure 10-30 ). 
The third page (again more than one page is printed if 
necessary) gives the distribution of hits recorded and printed 
by corporate code (see Figure 10-31 ). 

10.4.4 Profile Term Statistics 

Another listing prepared for each issue of each data 
base is shown in Figure 10-3 2. It gives statistics on term 
processing for all the terms in all the profiles. It includes; 

• number of input terms 

• number of unique terms 

• number of Least Common Bigrams (LCB's) found in 
the terms 
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PRODUCTION RUN SUGARY 



DATE OF RUN 
SERVICE, VOLUME, ISSUE 
CITATIONS ON TAPE 



PROFILES IN RUN 
NUMBER OF INPUT TERMS 

number of aggregated terms 

Pi .cent RECOCT ION 8 V AGGREGATION 


HITS RECORDED 
















HITS PRINTED 




... 












UN I OUE C I T AT I ONS 


RETRIEVED 


CAROS PRINTEO 


HITS PI- COR C ED/C I TAT ION 


RETRIEVL-C 


HITS PR IN TEC/C I TAT ION 


RETRIEVED 



RANGE OF HITS 
MEDIAN 
RANGE OF PRINTS 
NUMBER OF ZERO-HIT PROFILES 



COMPUTER SEARCH CENTER 



JUNE 17, 1972 

CA CONOF.NSATES 76:25 

5541 



129 


( 31 IN-HOUSE) 


3458 


(26.8/PROFILE ) 


2595 


(20.1/PROFILE) 


“ 25.00 




677 


( 21 . 8/PRCF ILE ) 


3962 


(40.4/PRGPILE ) 


4639 


(35.9/PRCFILC ) 


677 


(21.8/PROFILE ) 


3711 


(37. 8/PROFILE) 


4388 


(34.0/PROFILE) 


2819 


(21.8/PROFILE) 


4646 


(36.0/PROFILE ) 


1 • 64 




1.55 


... 



0 - 383 
19. C 
0 - 369 
11 



Figure 10-29 

PRODUCTION RUN SUMMARY— OVERALL SUMMARY 
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PRODUCTION RUN SUMMARY 



COMPUTER SfcARCH CENTER 



O 

ERIC 



UNF 17, 


1972 


CA CONUEN SATES 76: 




PROFILE 


HIT DISTRIBUTION 




NUMBER 
If iiLLS 
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3 


26 
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Figure 10-30 

PRODUCTION RUN SUMMARY — PROFILE HIT DISTRIBUTION 

235 256 



* 






PRODUCTION RUN SUMMARY COMPUTER SEARCH CENTER 



JUNE 17, 1972 CA CONDENSATES 76:25 
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• mean frequency of LCB's in the terms 

• standard deviation of term LCB frequency 

• mean frequency of all possible LCB's 

• mean group size (number of terms sorted under the 
average LCB) 

• standard deviation of group size 



10.4.5 Profile Term Frequency per Issue 

The Profile Term Frequency per Issue list provides, for each 
term in each profile, the number of times that term appeared in the 
issue of the data base that was searched. (See Figure 10-33) 

10.4.6 Profile Term and Hit, Cost Data-Summaries 

Data and averages are generated for each production search 
of each issue of each data base regarding profiles, terms, 
citations and hits. The following statistics prepared for each 
issue are summarized on Tables 10-14 through 10-24. 

© number of terms per profile 

* aggregation ratio for profile terms 

* total number of citations on the data base 

® number of citations retrieved by profiles in the run 

» average number citations retrieved (hits) per profile 

* , £ loc r f 8 u-l- Umbei ° f proCiies for wll ich a retrieved citation 
w ci s ci nit 



a 

a 

a 

a 

a 

a 

a 



average number of hits per profile normalized to the 

average number of citations per issue based on the com- 
plete volume 

computer cost per profile 

voTume Cr C ° St Per profile avera 8 e d to date within the 
computer cost per term 

computer cost per term averaged to date within the volume 
computer cost per hit 

computer cost per hit averaged to date within the volume 
computer cost per profile term per citation 
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Terms 


Cits. 




Hits 




Cost/Pr 


Issue 


Agg. 

Ratio 


Pet 

Profile 


Total 


Ret. 


Per 

Profile 


Hit/Ret 

Cit. 


. Norm. /Prof. 
Odd Even 


Issue 


9 


14.0 


21.8 


3955 


- 


8.4 




10.11 




7.39 


10 


17.9 


22.0 


6249 


- 


16.5 


- 




11.53 


10.22 


11 


17.8 


22.0 


4884 


- 


11.3 


- 


10.10 




10.24 


12 


18.0 


22.2 


5958 


- 


12.7 


- 




9.26 


11.32 


13 


16.5 


21.3 


5272 


- 


13.2 


- 


10.91 




10.32 


14 


17.0 


22.2 


5465 


- 


11.1 


- 




8.83 


11.50 : 


15 


23.0 


20.2 


3704 


- 


12.2 


- 


14.19 




5.44 


16 


16.4 


21.3 


5245 


1149* 


10.6 


1.30 




8.99 


4.33 


17 


14.3 


19.4 


4589 


1785* 


17.1 


1.41 


16.27 




4.63 


18 


18.7 


21.2 


5697 


1325* 


12.3 


1.24 




9.39 


4.40 


19 


21.6 


19.7 


4444 


1741* 


17.3 


1.43 


16.97 




4.41 


20 


18.7 


21.4 


6246 


1359* 


12.0 


1.18 




8.39 


5.10 


21 


17.1 


20.9 


4099 


1528* 


16.1 


1.61 


17.10 




4.65 


22 


19.9 


22.0 


4287 


1524* 


12.3 


1.16 




12.48 


6.24 


23 


23.2 


21.1 


4301 


1531* 


17.6 


2.36 


17.82 




4.19 


^Estimated 








(Data for 


issue nos. 1-8 





and 24-26 do not exist.) 
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Hits Cost/Profile Cost/Term Cost/Hit Cost/Term/Cit . 

Per Hit/Ret. Norm. /Prof. .5 



roflle 


Cit. 


Odd 


Even 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10 

Avg. 


8.4 


- 


10.11 




7.39 


7.39 


.34 


.34 


.99 


.89 


9.44 


9.44 


16.5 


- 




11.53 


10.22 


8.81 


.46 


.40 


.62 


.76 


7.42 


8.43 


11.3 


- 


10.10 




10.24 


9.28 


.47 


.42 


.91 


.81 


9.53 


8.80 


12.7 


- 




9.26 


11.32 


9.79 


.51 


.44 


.90 


.83 


8.58 


8.74 


13.2 


- 


10.91 




10.32 


9.90 


.48 


.45 


.78 


.82 


9.19 


8.83 


11.1 


- 




8.83 


11.50 


10.17 


.48 


.45 


.96 


.84 


8.73 


8.82 


12.2 


- 


14.19 




5.44 


9.49 


.27 


.43 


.45 


.79 


7.28 


8.60 


10.6 


1.30 




8.99 


4.33 


8.85 


.20 


.40 


.41 


.74 


3.82 


8.00 


17.1 


1.41 


16.27 




4.63 


8.37 


.24 


.38 


.27 


.69 


5.19 


7.69 


12.3 


1.24 




9.39 


4.40 


7.98 


.21 


.37 


.36 


.66 


3.65 


7.28 


17.3 


1.43 


16.97 




4.41 


7., 65 


.22 


.35 


.26 


.62 


5.02 


7.08 


12.0 


1.18 




8.39 


5.10 


7.44 


.24 


.34 


.42 


.60 


3.79 


6.80 


16.1 


1.61 


17.10 




4.65 


7.23 


.22 


.33 


.29 


.58 


5.41 


6.70 


12.3 


1.16 




12.48 


6.24 


7.16 


.28 


.33 


.50 


.57 


6.62 


6.69 


17.6 


2.36 


17.82 




4.19 


6.96 


.20 


.32 


.24 


.55 


4.60 


6.55 



(Data for issue nos. 1-8 
and 24-26 do not exist.) 
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Terms Cits. Hits Cost/Profile C 



Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret . 


Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof. 
Odd Even 


Issue 


Avg. 


1 


23.6 


21.2 


3771 


1462 


15.7 


1.62 


18.90 




3.91 


3.91 


2 


25.1 


22.0 


5296 


1364 


13.6 


1.45 




15.86 


4.42 


4.17 


3 


24.9 


21.4 


3969 


1757 


18.3 


1.67 


20.92 




3.69 


4.01 


4 


25.7 


22.7 


5254 


1438 


13.5 


1.38 




15.86 


4.64 


4.17 


5 


24.9 


21.6 


3616 


1451 


14.2 


1.54 


17.83 




3.38 


4.01 


6 


28.5 


23.9 


6310 


2177 


21.0 


1.52 




20.59 


6.51 


4.44 


7 


27.6 


21.9 


3958 


1689 


16.4 


1.61 


18.85 




3.78 


4.33 


8 


29.5 


23.5 


6468 


1858 


17.2 


1.48 




16.40 


6.33 


4.58 


9 


27.6 


21.7 


5438 


2203 


21.9 


1.61 


18.33 




5.01 


4.62 


10 


28.6 


24.7 


6629 


1878 


17.9 


1.39 




16.70 


6.92 


4.86 


11 


26.9 


23.6 


5121 


2088 


21.7 


1.56 


19.26 




5.42 


4.91 


12 


28.7 


25.7 


6650 


2006 


18.0 


1.35 




16.72 


7.59 


5.13 


13 


26.5 


23.9 


4707 


1903 


21.7 


1.72 


20.97 




4.98 


5.12 


14 


28.3 


26.0 


7291 


2105 


19.0 


1.40 




16.05 


8.33 


5.35 


15 


26.7 


23.9 


4915 


2012 


21.3 


1.66 


19.66 




5.27 


5.34 


16 


32.0 


27.4 


6572 


1964 


17.9 


1.44 




16.63 


8.33 


5.53 


17 


26.2 


24.1 


4820 


1995 


20.7 


1.63 


19.47 




5.02 


5.50 


18 


27.8 


25.6 


5604 


1443 


13.3 


1.41 




14.70 


5.77 


5.52 


19 


26.3 


23.8 


4629 


1832 


17.9 


1.56 


17.54 




4.66 


5.47 


20 


28.0 


25.5 


5743 


1641 


15.3 


1.47 




16.49 


5.75 


5.48 


21 


26.8 


26.4 


4924 


2117 


25.1 


1.61 


23.13 




3.89 


5.40 


22 


27.4 


27.8 


5644 


1485 


16.7 


1.40 




18.25 


3.35 


5.32 


23 


26.8 


26.7 


4467 


1813 


23.0 


1.59 


23.45 




2.87 


5.21 


24 


27.2 


28.1 


6549 


1571 


20.8 


1.34 




19.59 


4.36 


5.17 


25 


26.3 


27.4 


4702 


1858 


25.4 


1.56 


24.48 




2.80 


5.08 


26 


26.8 


28.9 


6287 


1490 


20.3 


1.32 




26.70 


4.68 


5.06 



Table 10-15 

PROFILE TERM, HIT, COST DATA VS. ISSUE 
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CONDENSATES 


VOLUME 
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Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit. 


Per 

Profile 


Hit /Ret. 
Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10’ 5 

Avg. 


15.7 


1.62 


18.90 


3.91 


3.91 


.18 


.18 


.25 


.25 


4.89 


4.89 


13.6 


1.45 


15.86 


4.42 


4.17 


.20 


.19 


.33 


.29 


3.80 


4.35 


18.3 


1.67 


20.92 


3.69 


4.01 


.17 


.18 


.20 


.26 


4.36 


4.35 


13.5 


1.38 


15.86 


4.64 


4.17 


.20 


.18 


.34 


.28 


3.89 


4.23 


14.2 


1.54 


17.83 


3.38 


4.01 


.16 


.18 


.24 


.27 


4.32 


4.25 


, 21.0 


1.52 


20.59 


6.51 


4.44 


.27 


.20 


.31 


.28 


4.31 


4.26 


16.4 


1.61 


18.85 


3.78 


4.33 


.17 


.19 


.23 


.27 


4.35 


4.27 


17.2 


1.48 


16.40 


6.33 


4.58 


.27 


.20 


.37 


.28 


4.17 


4.26 


21.9 


1.61 


18.33 


5.01 


4.62 


.23 


.21 


.23 


*28 


4.25 


4.26 


17.9 


1.39 


16.70 


6.92 


4.86 


.28 


.21 


.39 


.29 


4.23 


4.26 


21.7 


1.56 


19.26 


5.42 


4.91 


.23 


.22 


.25 


.29 


4.50 


4.28 


18.0 


1.35 


16.72 


7.59 


5.13 


.30 


.22 


.42 


.30 


4.44 


4.29 


21.7 


1.72 


20.97 


4.98 


5.12 


.21 


.22 


.23 


.29 


4.42 


4.30 


19.0 


1.40 


16.05 


3.33 


5.35 


.32 


.23 


.44 


.30 


4.40 


4.31 


21.3 


1.66 


19.66 


5.27 


5.34 


.22 


.23 


.25 


.30 


4.49 


4.32 


17.9 


1.44 


16.63 


8.33 


5.53 


.30 


.23 


.47 


.31 


4.63 


4.34 


20.7 


1.63 


19.47 


5.02 


5.50 


.21 


.23 


.24 


.30 


4.32 


4.34 


13.3 


1.41 


14.70 


5.77 


5.52 


.23 


.23 


.43 


.31 


4.02 


4.32 


17.9 


1.56 


17.54 


4.66 


5.47 


.20 


.23 


.26 


.31 


4.22 


4.32 


15.3 


1.47 


16.49 


5.75 


5.48 


.23 


.23 


.37 


.31 


3.02 


4.30 


25.1 


1.61 


23.13 


3.89 


5.40 


.15 


.22 


.16 


.30 


2.99 


4.23 


16.7 


1.40 


18.25 


3.35 


5.32 


.12 


.22 


.20 


.30 


2.14 


4.14 


23.0 


1.59 


23.45 


2.87 


5.21 


.11 


.21 


.12 


.29 


2.41 


4.06 


20.8 


1.34 


19.59 


4.36 


5.17 


.16 


.21 


.21 


.29 


2.37 


3.99 


25.4 


1.56 


24.48 


2.80 


5.08 


.10 


.21 


.11 


.28 


2.18 


3.92 


20.3 


1.32 


26.70 


4.68 


5.06 


.16 


.21 


.23 


.28 


2.57 


2.87 
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Cits. Hits Cost/Profile 



Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profile 


Hit/Ret . 
Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg 


1 


25.66 


27.88 


4182 


1616 


22.13 


1.51 


24*05 


2.45 


2.45 


2 


27.28 


30.05 


6321 


1908 


26.15 


1.39 


24.79 


5.22 


3.84 


3 


25.20 


28.70 


4731 


2016 


28.70 


1.58 


27.54 


4.13 


3.93 


4 


25.80 


30.30 


5855 


1716 


26.20 


1.41 


26.95 


5.00 


4.20 


5 


26.80 


30.91 


4734 


1963 


30.88 


1.56 


29.66 


4.14 


4.19 


6 


25.36 


30.29 


5523 


1419 


21.51 


1.38 


23.34 


3.82 


4.14 


7 


26.02 


32.32 


4351 


1740 


28.14 


1.50 


29.41 


3.80 


4.08 


8 


26.32 


31.26 


5887 


1438 


22.75 


1.38 


23.16 


3.90 


4.06 


9 


25.36 


32.47 


4190 


1705 


27.16 


1.50 


29.12 


3.43 


3.99 


10 


26.54 


32.22 


6275 


1502 


23.84 


1.38 


22.77 


4.08 


4i00 


11 


31.66 


33.50 


4306 


1862 


31.83 


1.64 


33.61 


3.54 


3.96 


12 


28.70 


32.47 


6016 


1493 


23.08 


1.41 


22.99 


3.84 


3.95 


13 


30.78 


33.74 


4425 


1874 


33.48 


1.63 


34.40 


4.75 


4.04 


14 


27.04 


33.01 


5974 


1594 


28.35 


1.44 


28.43 


4.29 


4.06 


15 


31.59 


34.21 


4737 


2019 


35.25 


1.61 


38*30 


3.82 


4.05 


16 


27.14 


33.63 


6056 


1963 


38.58 


1.55 


38.17 


4.88 


4.09 


17 


21.53 


36.05 


4487 


2191 


40.74 


1.56 


41.28 


4.61 


4.12 


18 


22.83 


33.99 


6253 


1994 


38.04 


1.51 


36.46 


A. 78 


4.16 


19 


21.93 


35.16 


4774 


2175 


36.50 


1.44 


34.77 


4.76 


4.19 


20 


23.41 


32.48 


6039 


1901 


31.19 


1.44 


30.95 


4.61 


4.21 


21 


22.07 


34.53 


4880 


2175 


35.68 


1.48 


33.24 


5.06 


4.23 


22 


24.40 


33.70 


6042 


1824 


31.13 


1.48 


30.89 


4.46 


4.24 


23 


22.70 


33.30 


4717 


2083 


32.58 


1.51 


31.41 


4.48 


4.25 


24 


24.50 


32.13 


5800 


1676 


26.35 


1.47 


27.22 


4.07 


4.24 


25 


22.70 


33.46 


4595 


2143 


33.98 


1.53 


33.64 


4.37 


4.25 


26 


24.50 


32.27 


5862 


1710 


25.89 


1.42 


26.47 


4.23 


4.25 




?. €5 
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Hits Cost/Profile Cost/Term Cost/Hit Cost/Term/Cit . 



Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10- 5 

Avg 


22.13 


1.51 


24 .-05 


2.45 


2.45 


.09 


.13 


.11 


.16 


2.10 


2.10 


26.15 


1.39 


24.79 


5.22 


3.84 


.17 


.13 


.20 


.16 


2.68 


2.39 


28.70 


1.58 


27.54 


4.13 


3.93 


.14 


.13 


.14 


.15 


2.95 


2.58 


26.20 


1.41 


26.95 


5.00 


4.20 


.17 


.14 


.19 


.16 


2.81 


2.63 


30.88 


1.56 


29.66 


4.14 


4.19 


.13 


.14 


.13 


.15 


2.83 


2.67 


21.51 


1.38 


23.34 


3.82 


4.14 


.13 


.14 


.18 


.16 


2.29 


2.61 


28.14 


1.50 


29.41 


3.80 


4.08 


.12 


.14 


.14 


.16 


2.71 


2.62 


22.75 


1.38 


23.16 


3.90 


4.06 


.12 


.13 


.17 


.16 


2.12 


2.56 


27.16 


1.50 


29.12 


3.43 


3.99 


.11 


.13 


.13 


.15 


2.52 


2.55 


23.84 


1.38 


22.77 


4 <08 


4; 00 


.13 


.13 


*17 


*16 


2.02 


2.50 


31.83 


1.64 


33.61 


3.54 


3.96 


.11 


.13 


.11 


.15 


2.46 


2.50 


23.08 


1.41 


22.99 


3.84 


3.95 


.12 


.13 


.17 


.15 


1.96 


2.45 


33.48 


1.63 


34.40 


4.75 


4.04 


.14 


.13 


.14 


.15 


3.09 


2.50 


28.35 


1.44 


28.43 


4.29 


4.06 


.13 


.12 


.15 


.15 


2.18 


2.48 


35.25 


1.61 


38.30 


3.82 


4.05 


.11 


.13 


.11 


.15 


2.36 


2.47 


38.58 


1.55 


38.17 


4.88 


4.09 


.15 


.13 


.13 


.15 


2.40 


2.47 


40.74 


1.56 


41.28 


4.61 


4.12 


.13 


.13 


.11 


.15 


2.85 


2.49 


38.04 


1.51 


36.46 


4.78 


4.16 


.14 


.13 


.13 


.15 


2.25 


2.48 


36.50 


1.44 


34.77 


4.76 


4.19 


.14 


.13 


.13 


.14 


2.83 


2.49 


31.19 


1.44 


30.95 


4.61 


4.21 


.14 


.13 


.15 


.14 


2.35 


2.49 


35.68 


1.48 


33.24 


5.06 


4.23 


.15 


.13 


.14 


. Is 


3.00 


2.51 


31.13 


1.48 


30.89 


4.46 


4.24 


.13 


.13 


.14 


.14 


2.19 


2.50 


32.58 


1.51 


31.41 


4.48 


4.25 


.13 


.13 


.14 


.14 


2.85 


2.51 


26.35 


1.47 


27.22 


4.07 


4.24 


.13 


.13 


.15 


.14 


2.18 


2.50 


33.98 


1.53 


33.64 


4.37 


4.25 


.13 


.13 


.13 


.14 


2.84 


2.51 


25.89 


1.42 


26.47 


4.23 


4.25 


.13 


.13 


.16 


.14 


2.24 


2.48 
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Terms Cits. Hits Cost/Profile 



Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg. 


1 


22.5 


32.4 


3858 


1690 


24.7 


1.45 


32.04 


3.69 


3.69 


2 


24.1 


30.9 


5577 


1759 


26.2 


1.43 


31.09 


4.11 


3.90 


3 


22.5 


32.4 


3961 


1760 


26.3 


1.48 


33.12 


3.70 


3.83 


4 


24.1 


29.8 


5738 


1735 


23.2 


1,42 


26.82 


3.94 


3.86 


5 


22.7 


31.4 


3957 


1761 


22.7 


1.43 


28.67 


3.81 


3.85 


6 


24.9 


30.2 


4746 


1409 


20.1 


1.45 


28.09 


3.39 


3.77 


7 


23.2 


30.0 


4087 


1895 


24.9 


1.47 


30.39 


3.25 


3.70 


8 


28.9 


29.1 


6816 


2435 


31.5 


1.45 


36.99 


4.18 


3.76 


9 


23.4 


30.0 


4475 


2005 


26.0 


1.47 


29.04 


4.00 


3.79 


10 


25.5 


28.8 


5986 


2090 


28.1 


1.45 


27.31 


4.48 


3.86 


11 


25.1 


29.3 


5618 


2262 


30.1 


1.56 


31.65 


3.86 


3.86 


12 


28.2 


29.1 


6553 


2044 


28.8 


1.62 


29.12 


4.29 


3.90 


13 


25.3 


28.4 


3667 


1930 


25.3 


1.65 


34.40 


4.52 


3.95 


14 


28.2 


28.1 


7074 


2540 


32.4 


1.67 


30.32 


4.77 


4.01 


15 


25.2 


28.4 


5174 


2705 


34.7 


1.61 


44.29 


6.31 


4.16 


16 


28.5 


27.9 


6190 


1983 


23.6 


1.57 


25.25 


4.52 


4.18 


17 


24.1 


28.2 


5458 


1812 


26.9 


1.50 


29.22 


5.27 


4.24 


18 


25.1 


27.9 


6350 


1855 


21.7 


1.48 


16.57 


4.18 


4.24 


19 


24.3 


27.9 


6243 


3026 


38.0 


1.60 


30.30 


5.37 


4.30 


20 


25.4 


27.6 


6519 


2023 


21.5 


1.43 


15.22 


4.21 


4.29 


21 


23.9 


27.8 


5458 


2752 


34.5 


1.59 


31.54 


4.93 


4.32 


22 


25.3 


27.8 


7442 


2565 


28.7 


1.48 


25.53 


4.85 


4.34 


23 


25.5 


28.2 


6151 


2842 


34.9 


1.53 


28.32 


4.89 


4.36 


24 


27.4 


29.2 


8732 


3154 


38.6 


1.58 


29.30 


5.96 


4.43 


25 


25.8 


28.3 


7531 


3543 


43.5 


1.55 


28.83 


5.97 


4.49 


26 


28.9 


28.7 


5722 


2837 


32.7 


1.60 


25.10 


5.28 


4.52 
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Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit. 


Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10" 5 

Avg. 


24.7 


1.45 


32.04 


3.69 


3.69 


.11 


.11 


.15 


.15 


2,96 


2.96 


26.2 


1.43 


31.09 


4.11 


3.90 


.13 


.12 


.16 


.16 


2.38 


2.67 


26.3 


1.48 


33.12 


3.70 


3.83 


.11 


.12 


.14 


.15 


2.88 


2.74 


23.2 


1.42 


26.82 


3.94 


3.86 


.13 


.12 


.17 


.16 


2.30 


2.63 


22.7 


1.43 


28.67 


3.81 


3.85 


.12 


.12 


.17 


.16 


3.07 


2.72 


1 20.1 


1.45 


28.09 


3.39 


3.77 


.11 


.12 


.17 


.16 


2.36 


2.66 


24.9 


1.47 


30.39 


3.25 


3.70 


.11 


.12 


.13 


.16 


2.65 


2.66 


31.5 


1.45 


36.99 


4.18 


3.76 


.14 


.12 


.12 


.15 


2.33 


2.62 


26.0 


1.47 


29.04 


4.00 


3.79 


.13 


.12 


.15 


.15 


2.98 


2.66 


28.1 


1.45 


27.31 


4.48 


3.86 


.16 


.13 


.16 


.15 


2.29 


2.62 


30.1 


1.56 


31.65 


3.86 


3.86 


.13 


.13 


.13 


.15 


2.78 


2.63 


28.8 


1.62 


29.12 


4.29 


3.90 


.15 


.13 


.15 


.15 


2.25 


2.60 


25.3 


1.65 


34.40 


4.52 


3.95 


.16 


.13 


.18 


.15 


4.73 


2.77 


32.4 


1.67 


30.32 


4.77 


4.01 


.17 


.13 


.15 


.15 


2.39 


2.74 


34.7 


1.61 


44.29 


6.31 


4.16 


.22 


.14 


.18 


.15 


4.28 


2.84 


23.6 


1.57 


25.25 


4.52 


4.18 


.16 


.14 


.19 


.16 


2.61 


2.83 


26.9 


1.50 


29.22 


5.27 


4.24 


.19 


.14 


.16 


.16 


3.43 


2.86 


21.7 


1.48 


16.57 


4.18 


4.24 


.15 


.14 


.29 


.16 


2.56 


2.85 


38.0 


1.60 


30.30 


5.37 


4.30 


.19 


.15 


.14 


.16 


3.08 


2.86 


21.5 


1.43 


15.22 


4.21 


4.29 


.15 


.15 


.28 


.17 


2.33 


2.83 


34.5 


1.59 


31.54 


4.93 


4.32 


.18 


.15 


.14 


.17 


3.24 


2.85 


28.7 


1.48 


25.53 


4.85 


4.34 


.17 


.15 


.17 


.17 


2.34 


2.83 


34.9 


1.53 


28.32 


4.89 


4.36 


.17 


.15 


.14 


.17 


2.82 


2.83 


38.6 


1.58 


29.30 


5.96 


4.43 


.20 


.15 


.15 


.17 


2.33 


2.81 


43.5 


1.55 


28.83 


5.97 


4.49 


.21 


.15 


.14 


.16 


2.80 


2.81 


32.7 


1.60 


25.10 


5.28 


4.52 


.18 


.16 


.16 


.16 


2.14 


2.78 
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Terms Cits. 



Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret 


1 


24.9 


27.0 


6342 


3055 


2 


22.8 


25.9 


8363 


2846 


3 


24.9 


27.0 


6260 


3186 


4 


22.8 


25.9 


8720 


3271 


5 


21.6 


26.5 


6917 


3254 


6 


22.6 


25.0 


7964 


2847 


7 


21.4 


25.6 


4906 


2367 


8 


22.3 


24.4 


8856 


3099 


9 


22.0 


25.3 


5803 


2689 


10 


22.6 


24.2 


6870 


2358 


11 


21.9 


25.2 


6510 


2864 


12 


21.6 


24.0 


6694 


2320 


13 


20.9 


24.0 


4960 


2309 


14 


21.8 


23.5 


5461 


1789 


15 


21.8 


23.5 


4362 


1551 


16 


21.7 


23.8 


6048 


1985 


17 


20.9 


24.1 


5690 


2597 


18 


21.6 


23.8 


4989 


1685 


19 


22.1 


24.3 


3016 


1292 


20 


22.1 


23.5 


7438 


2692 


21 


21.0 


25.0 


4030 


1637 


22 


21.0 


24.1 


5681 


2548 


23 


20.4 


25.0 


4328 


1737 


24 


20.0 


26.4 


4120 


2551 


25 


20.4 


24.8 


4481 


1807 


26 


20.1 


24.6 


4159 


2369 



Hits Cost/Profil 



Per 

Profile 


Hit/Ret . 
Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg 


37.7 


1.65 


30.81 


4.33 


4.3 


30.8 


1.45 


25.19 


4.00 


4.1 


39.8 


1.67 


32.93 


4.38 


4.2 


38.1 


1.56 


29.87 


4.36 


4.2 


43.0 


1.58 


32.15 


4.89 


4.3 


31.9 


1.52 


27.39 


3.95 


4.3 


29.3 


1.57 


30.91 


3.50 


4.2 


32.5 


1.48 


25.10 


4.18 


4.2 


31.5 


1.57 


28.07 


4.02 


4.1 


24.8 


1.52 


24.63 


3.26 


4.0 


33.7 


1.58 


28.42 


4.01 


4.0 


26.0 


1.52 


26.52 


3.17 


4. 0< 


26.8 


1.48 


27.95 


2.60 


3.9( 


20.1 


1.54 


25.20 


1.99 


3.7< 


14.9 


1.32 


17.76 


2.20 


3.6( 


22.1 


1.50 


24.97 


2.74 


3.6( 


29.2 


1.40 


26.56 


2.96 


3.5( 


19.5 


1.53 


26.75 


2.34 


3.4< 


15.3 


1.45 


26.39 


1.64 


3.4( 


30.3 


1.55 


27.31 


3.18 


3.3< 


20.9 


1.44 


26.81 


2.18 


3.3: 


30.8 


1.45 


31.48 


3.11 


3 . 3 : 


22.5 


1.49 


26.95 


2.40 


3.2* 


32.5 


1.44 


53.81 


3.22 


3.2* 


22.8 


1.45 


26.32 


2.54 


3.2* 


32.5 


1.58 


53.40 


3.01 


3.2^ 
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PROFILE TERM, HIT, COST DATA VS 



ISSUE 
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VOLUME 
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Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit. 


Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof. 
Odd Even 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10' 5 

Avg. 


37.7 


1.65 


30.81 


4.33 


4.33 


.16 


.16 


.11 


.11 


2.53 


2.53 


30.8 


1.45 


25.19 


4.00 


4.17 


.15 


.15 


.13 


.12 


1.84 


2.19 


39.8 


1.67 


32.93 


4.38 


4.24 


.16 


.16 


.11 


.12 


2.59 


2.32 


38.1 


1.56 


29.87 


4.36 


4.27 


.17 


.16 


.11 


.12 


1.93 


2.22 


43.0 


1.58 


32.15 


4.89 


4.39 


.18 


.16 


.11 


.12 


2.66 


2.31 


31.9 


1.52 


27.39 


3.95 


4.32 


.16 


.16 


.12 


.12 


1.98 


2.26 


29.3 


1.57 


30.91 


3.50 


4.20 


.14 


.16 


.12 


.12 


2.78 


2.33 


32.5 


1.48 


25.10 


4.18 


4.20 


.17 


.16 


.13 


.12 


1.93 


2.28 


31.5 


1.57 


28.07 


4.02 


4.18 


.16 


.16 


.13 


.12 


2.74 


2.33 


24.8 


1.52 


24.63 


3.26 


4.09 


.13 


.16 


.13 


.12 


1.96 


2.29 


33.7 


1.58 


28.42 


4.01 


4.08 


.16 


.16 


.12 


.12 


2.60 


2.32 


26.0 


1.52 


26.52 


3.17 


4.00 


.13 


.16 


.12 


.12 


1.96 


2.29 


26.8 


1.48 


27.95 


2.60 


3.90 


.11 


.15 


.10 


.12 


2.18 


2.28 


20.1 


1.54 


25.20 


1.99 


3.76 


.08 


.15 


.10 


,12 


1.54 


2.23 


14.9 


1.32 


17.76 


2.20 


3.66 


.09 


.14 


.11 


.12 


2.00 


2.21 


22.1 


1.50 


24.97 


2.74 


3.60 


.11 


.14 


.12 


.12 


1.90 


2.20 


29.2 


1.40 


26.56 


2.96 


3.56 


.12 


.14 


.10 


.12 


2.15 


2.19 


19.5 


1.53 


26.75 


2.34 


3.49 


.10 


.14 


.12 


.12 


1.96 


2.18 


15.3 


1.45 


26.39 


1.64 


3.40 


.07 


.13 


.11 


.12 


2.23 


2.18 


30.3 


1.55 


27.81 


3.18 


3.39 


.14 


.13 


.10 


.12 


1.81 


2.16 


20.9 


1.44 


26.81 


2.18 


3.33 


.09 


.13 


.10 


.11 


2.15 


2.16 


30.8 


1.45 


31.48 


3.11 


3.32 


.13 


.13 


.10 


.11 


1.92 


2.15 


22.5 


1.49 


26.95 


2.40 


3.28 


.10 


.13 


.11 


.11 


2.21 


2.15 


32.5 


1.44 


53.81 


3.22 


3.28 


.13 


.13 


.10 


.11 


2.07 


2.15 


22.8 


1.45 


26.32 


2.54 


3.25 


.10 


.13 


.11 


.1! 


2.27 


2.16 


32.5 


1.58 


53.40 


3.01 


3.24 


.12 


.13 


.16 


.11 


2.10 


2.15 
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Terms 



Cits. 



Hits 



Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profile 


Hit/Ret. 

Cit. 


Norm. /Prof . 
Odd Even 


Issue 


Avg 


1 


20.1 


24.6 


4159 


1717 


22.3 


1.49 


28.34 


1.92 


l.S 


2 


23.1 


23.6 


4816 


2441 


30.7 


1.77 


44.77 


1.81 


l.S 


3 


23.0 


24.0 


6051 


1901 


20.8 


1.43 


18.18 


2.07 


l.S 


4 


23.2 


23.3 


6051 


2679 


30.9 


1.79 


35.92 


2.36 


2.C 


5 


23.1 


23.6 


4592 


1790 


18.8 


1.39 


21.60 


1.94 


2.C 


6 


24.2 


23.1 


4126 


2180 


24.1 


1.78 


41.11 


1.83 


2.0 


7 


25.0 


24.0 


7187 


1813 


19.9 


1.52 


25.41 


1.76 


1.9 


8 


24.9 


23.8 


7187 


3033 


33.9 


1.76 


33.15 


2.87 


2.0 


9 


25.3 


24.8 


4732 


2229 


25.7 


1.57 


28.70 


2.24 


2.1 


10 


28.1 


24.6 


7625 


2394 


42.3 


2.69 


38.98 


3.06 


2.2 


11 


26.7 


23.9 


5923 


2874 


30.9 


1.58 


27.50 


2.46 


2.2 


12 


27.8 


23.9 


7887 


3139 


40.2 


1.89 


35.80 


2.87 


2.2 


13 


26.5 


23.5 


4806 


2739 


32.2 


1.72 


35.32 


1.92 


2.2 


14 


27.6 


23.7 


8103 


3487 


46.3 


1.91 


40.19 


3.04 


2.3 


15 


28.5 


24.0 


5236 


2609 


26.9 


1.56 


27.15 


2.14 


2.2 


16 


28.4 


23.3 


8200 


3747 


47.9 


1.92 


41.07 


3.05 


2.3' 


17 


27.6 


23.4 


5706 


3052 


32.3 


1.65 


29.84 


2.12 


2.3 


18 


29.2 


23.1 


7620 


3335 


40.9 


1.97 


37.76 


2.08 


2.3 


19 


27.7 


24.0 


5815 


2874 


31.0 


1.65 


28.10 


1.79 


2.2 


20 


28.7 


24.2 


5815 


3691 


49.7 


2.04 


45.36 


2.22 


2.2' 


21 


27.9 


24.6 


6153 


3053 


34.3 


1.66 


28.36 


1.69 


2.2i 


22 


28.3 


25.3 


7900 


3678 


47.7 


1.93 


42.46 


2.20 


2.2< 


23 


25.8 


25.4 


5910 


2951 


34.0 


1.61 


30.38 


1.69 


2.2^ 


24 


27.8 


25.9 


7258 


3374 


47.2 


2.00 


45.68 


2.23 


2.2i 


25 


25.0 


26.8 


5541 


2819 


21.8 


1.64 


34.19 


1.67 


2 . 2 : 


26 


27.8 


26.9 


7870 


4026 


51.2 


2.01 


51.90 


2.50 


2 . 2 : 
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PROFILE TERM, HIT, COST DATA VS 



ISSUE 



CHEMICAL ABSTRACTS CONDENSATES 


VOLUME 


76 
















Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit . 


Per 

Profile 


Ilit/Ret 

Cit. 


. Norm. /Prof. 
Odd Even 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg, 


. Issue 


Avg. 


22.3 


1.49 


28.34 


1.92 


1.92 


.08 


.08 


.09 


.09 


1.87 


1.87 


30.7 


1.77 


44.77 


1.81 


1.87 


.08 


.08 


.06 


.08 


1.29 


1.58 


20.8 


1.43 


18.18 


2.07 


1.94 


.09 


.08 


.10 


.08 


1.79 


1.65 


30.9 


1.79 


35.92 


2.36 


2.05 


.10 


.09 


.07 


.08 


1.63 


1.65 


18.8 


1.39 


21.60 


1.94 


2.03 


.08 


.09 


.10 


.09 


1.78 


1.67 


24.1 


1.78 


41.11 


1.83 


2.00 


.08 


.09 


.08 


.08 


1.63 


1.67 


19.9 


1.52 


25.41 


1.76 


1.97 


.07 


.08 


.09 


.08 


1.77 


1.68 


33.9 


1.76 


33.15 


2.87 


2.08 


.12 


.09 


.10 


.09 


1.67 


1.68 


25.7 


1.57 


28.70 


2.24 


2.10 


.09 


.09 


.09 


.09 


1.91 


1.70 


42.3 


2.69 


38.98 


3.06 


2.20 


.12 


.09 


.10 


.09 


1.62 


1.70 


30.9 


1.58 


27.50 


2.46 


2.22 


.10 


.09 


.08 


.09 


1.73 


1.70 


40.2 


1.89 


35.80 


2.87 


2.27 


.12 


.09 


.07 


.09 


1.52 


1.68 


32.2 


1.72 


35.32 


1.92 


2.24 


.08 


.09 


.06 


.08 


1.69 


1.69 


46.3 


1.91 


40.19 


3.04 


2.30 


.13 


.10 


.13 


.09 


1.58 


1.68 


26.9 


1.56 


27.15 


2.14 


2.29 


.09 


.10 


.08 


.09 


1.70 


1.68 


47.9 


1.92 


41.07 


3.05 


2.34 


.13 


.10 


.06 


.09 


1.59 


1.67 


32.3 


1.65 


29.84 


2.12 


2.33 


.09 


.10 


.07 


.08 


1.59 


1.67 


40.9 


1.97 


37.76 


2.08 


2.32 


.09 


.10 


.05 


.08 


1.18 


1.64 


31.0 


1.65 


28.10 


1.79 


2.29 


.07 


.10 


.06 


.08 


1.28 


1.62 


49 o 7 


2.04 


45.36 


2.22 


2.29 


.09 


.10 


.04 


.08 


1.18 


1.60 


34.3 


1.66 


28.36 


1.69 


2.26 


.07 


.09 


.05 


.08 


1.11 


1.58 


47.7 


1.93 


42.46 


2.20 


2.26 


.09 


.09 


.05 


.08 


1.10 


1.56 


34.0 


1.61 


30.38 


1.69 


2.24 


.07 


.09 


.05 


.08 


1.13 


1.54 


47.2 


2.00 


45.68 


2.23 


2.24 


.09 


.09 


.05 


.08 


1.18 


1.53 


21.8 


1.64 


34.19 


1.67 


2.22 


.06 


.09 


.05 


.07 


1.12 


1.51 


51.2 


2.01 


51.90 


2.50 


2.23 


.09 


.09 


.04 


.07 


1.17 


1.50 
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0 


















272 




ERIC 























BICPESEARCH INDEX VOLUME 71 
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Terms 


Cits 


• 


Hits 


Cost/Profile 


Cost/Tern 


Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profile 


Hit /Ret . 
Cit. 


Issue 


Avg., 


Issue 


Avg 


1 


13.30 


29.4 


7500 


1410 


33.5 


1.22 


9.43 


9.43 


.32 


.32 


2 


11.60 


25.7 


7500 


890 


24.7 


1.11 


11.83 


10.63 


.46 


.3$ 


3 


11.50 


25.6 


7500 


731 


20.3 


1.07 


9.29 


10.18 


.36 


.38 


4 


12.20 


24.2 


5833 


1241 


28.2 


1.11 


7.19 


9.44 


.30 


.36 


5 


12.40 


23.3 


7500 


1393 


29.6 


1.15 


7.68 


9.08 


.33 


.35 


6 


12.70 


22.9 


7500 


1542 


30.8 


1.13 


7.65 


8.85 


.33 


.35 


7 


12.20 


19.3 


7500 


4788 


77.2 


1.46 


8.44 


8.79 


.40 


.36 


8 


12.10 


18.8 


7500 


2625 


40.3 


1.26 


5.76 


8.41 


.31 


.35 


9 


10.30 


19.3 


7500 


2850 


46.7 


1.25 


6.56 


8.20 


.34 


.35 


10 


14.00 


18.8 


7500 


2611 


38.3 


1.23 


5.04 


7.89 


.28 


.34 


11 


11.80 


20.1 


7500 


2130 


40.1 


1.19 


7.87 


7.89 


.39 


.35 


12 


11.70 


20.4 


7500 


2501 


48.0 


1.24 


7.71 


7.87 


.38 


.35 



TABLE 10-20 



PROFILE TERM, HIT, COST DATA VS. ISSUE 





BTCRESEARCH 


INDEX 


VOLUME ‘ 


71 














> • 


Hits 




Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Terra/Cit. 


Ret. 


Per Hit/Ret. 

Profile Cit. 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10" 5 

Avg. 


1410 


33.5 


1.22 


9.43 


9.43 


.32 


.32 


.28 


.28 


4.28 


4.28 


890 


24.7 


1.11 


11.83 


10.63 


.46 


.39 


.48 


.38 


6.12 


5.20 


731 


20.3 


1.07 


9.29 


10.18 


.36 


.38 


.46 


.41 


4.83 


5.08 


1241 


28.2 


1.11 


7.19 


9.44 


.30 


.36 


.28 


.38 


3.96 


4.80 


1393 


29.6 


1.15 


7.68 


9.08 


.33 


.35 


.26 


.35 


4.38 


4.71 


1542 


30.8 


1.13 


7.65 


8.85 


.33 


.35 


.25 


.34 


4.45 


4.67 


4788 


77.2 


1.46 


8.44 


8.79 


.40 


.36 


.12 


.30 


5.34 


4.77 


2625 


40.3 


1.26 


5.76 


8.41 


.31 


.35 


.14 


.28 


4.08 


4.68 


2850 


46.7 


1.25 


6.56 


8.20 


.34 


.35 


.14 


.27 


4.52 


4.66 


2611 


38.3 


1.23 


5.04 


7.89 


.28 


.34 


.13 


.25 


6.89 


4.89 


2130 


40.1 


1.19 


7.87 


7.89 


.39 


.35 


.20 


.25 


5.20 


4.91 


2501 


48.0 


1.24 


7.71 


7.87 


.38 


.35 


.16 


.24 


5.03 


4.92 
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Terms 


Cits . 


Hits 


Cost/Profile 


Cost/T 


T Agg. Per 

ssue Ratio Profile 


Total Ret. 


Per Hit/Ret. 
Profile Cit. 


Issue Avg 


Issue i 











(Data 


for Issues 1-6 


do not exist.) 






7 


17.31 


38.59 


5832 


681 


30.95 


1.15 


6.38 


6.38 


.17 




8 


17.31 


38.59 


5833 


628 


28.55 


1.15 


5.93 


6.16 


.15 




9 


17.31 


38.59 


4088 


510 


28.18 


1.18 


4.52 


5.61 


.12 




10 


16.44 


41.00 


5833 


967 


42.04 


1.16 


10.27 


6.78 


.25 




11 


16.25 


34.79 


5833 


945 


32.59 


1.25 


7.73 


6.97 


.22 




12 


15.42 


33.15 


5835 


1023 


37.89 


1.22 


8.07 


7.15 


.24 




13 


15.42 


33.15 


5833 


845 


31.30 


1.19 


7.37 


7.18 


.22 




14 


18.06 


32.66 


5834 


836 


28.83 


1.19 


6.60 


7.11 


.20 




15 


18.31 


33.13 


5674 


986 


32.87 


1.26 


6.66 


7.06 


.20 




16 


18.29 


33.17 


5833 


990 


33.00 


1.27 


5.72 


6.93 


.17 




17 


16.13 


34.00 


5836 


1112 


35.87 


1.18 


5.92 


6.83 


.17 




18 


16.22 


33.81 


5837 


975 


31.45 


1.28 


5.78 


6.73 


.17 




19 


16.55 


31.77 


5836 


1133 


32.37 


1.19 


7, .80 


6.82 


.25 


« 


20 


16.84 


31.23 


5838 


1053 


30.09 


1.22 


M5 


6.85 


.23 


« 


21 


14.35 


32.11 


5836 


2585 


68.03 


1.37 


6.61 


6.83 


.21 


« 


22 


13.00 


31.46 


5836 


2192 


59.24 


1.28 


7.19 


6.86 


.23 


i 


23 


15.64 


29.80 


5834 


1371 


29.80 


1.26 


6.26 


6.82 


.21 


• 


24 


15.64 


29.41 


5833 


1357 


29.41 


1.25 


5.59 


6.75 


.19 


« 
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Table 10-21 

PROFILE TERM, HIT, COST DATA VS. ISSUE 
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s. 


Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit . 


Ret . 


Per Hit/Ret. 
Profile Cit. 


Issue 


Avg 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10- 5 

Avg. 


(Data for Issues 1-6 do not exist.) 














681 


30.95 


1.15 


6.38 


6.38 


.17 


.17 


.21 


.21 


2.83 


2.83 


628 


28.55 


1.15 


5.93 


6.16 


.15 


.16 


.21 


.21 


2.64 


2.74 


510 


28.18 


1.18 


4.52 


5.61 


.12 


.15 


.20 


.21 


2.87 


2.78 


967 


42.04 


1.16 


10.27 


6.78 


.25 


.17 


.24 


.22 


4.30 


3.16 


945 


32.59 


1.25 


7.73 


6.97 


.22 


.18 


.24 


.22 


3.81 


3.29 


1023 


37.89 


1.22 


8.07 


7.15 


.24 


.19 


.21 


.22 


4.17 


3.44 


845 


31.30 


1.19 


7.37 


7.18 


.22 


.20 


.24 


.22 


3.81 


3.49 


836 


28.83 


1.19 


6.60 


7.11 


.20 


.20 


.23 


.22 


3.47 


3.49 


986 


32.87 


1.26 


6.66 


7.06 


.20 


.20 


.20 


.22 


3.54 


3.49 


990 


33.00 


1.27 


5.72 


6.93 


.17 


.19 


.17 


.22 


2.96 


3.44 


1112 


35.87 


1.18 


5.92 


'>.83 


.17 


.19 


.16 


.21 


2.98 


3.40 


975 


31.45 


1.28 


5.78 


6.73 


.17 


.19 


.18 


.21 


2.93 


3.36 


1133 


32.37 


1.19 


7.80 


6.82 


.25 


.20 


.24 


.21 


4.21 


3.42 


1053 


30.09 


1.22 


7.15 


6.85 


.23 


.20 


.24 


.21 


3.92 


3.46 


2585 


68.03 


1.37 


6.61 


6.83 


.21 


.20 


.10 


.21 


3.53 


3.46 


2192 


59.24 


1.28 


7.19 


6.86 


.23 


.20 


.12 


.20 


3.91 


3.49 


1371 


29.80 


1.26 


6.26 


6.82 


.21 


.20 


.21 


.20 


3.63 


3.50 


1357 


29.41 


1.25 


5.59 


6.75 


.19 


.20 


.19 


.20 


3.24 


3.49 
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Terms 


Cits 


• 


Hits 


Cost/Profi le 


Cost/' 


Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profi le 


Hit/Ret. 

Cit. 


Issue 


Avg. 


Issue 


1 


15.70 


29.6 


5833 


1675 


36.4 


1.28 


6.20 


6.20 


.21 


2 


15.10 


31.0 


5833 


1671 


37.1 


1.28 


7.54 


6.87 


.24 


3 


13.70 


29.4 


5834 


1413 


35.3 


1.23 


7.19 


6.98 


.24 


4 


13.60 


29.4 


5833 


1411 


34.4 


1.21 


7.80 


7.18 


.26 


5 


11.60 


25.7 


5833 


754 


20.9 


1.14 


7.30 


7.21 


.28 


6 


11.60 


25.7 


5834 


801 


22.2 


1.12 


7.34 


7.23 


.28 


7 


11.50 


25.6 


5833 


717 


19.9 


1.08 


7.64 


7.29 


.30 


8 


11.50 


25.6 


5833 


717 


19.9 


1.08 


7.55 


7.32 


.29 


9 


11.81 


25.1 


5833 


789 


19.5 


1.12 


7.03 


7.29 


.28 


10 


11.81 


25.1 


5833 


913 


22.9 


1.13 


7.76 


7.34 


.31 


11 


12.70 


22.9 


5834 


1693 


33.8 


1.21 


7.30 


7.33 


.32 


12 


15.10 


21.0 


5834 


3155 


46.3 


1.43 


5.93 


7.22 


.28 


13 


15.10 


21.0 


5833 


2937 


43.1 


1.42 


5.00 


7.04 


.24 


14 


15.10 


21.0 


5833 


2805 


41.2 


1.38 


5.71 


6.95 


.27 


15 


11.23 


18.6 


5833 


3552 


52.7 


1.48 


3.16 


6.70 


.17 


16 


11.50 


18.6 


5836 


2142 


33.4 


1.28 


5.81 


6.64 


.31 


17 


10.80 


18.9 


5832 


2184 


36.4 


1.27 


5.53 


6.58 


.29 


18 


10.80 


18.9 


5836 


2260 


37.6 


1.29 


6.05 


6.55 


.32 


19 


10.80 


18.9 


5836 


2310 


38.5 


1.29 


3.99 


6.41 


.21 


20 


10.80 


18.9 


5833 


2241 


37.3 


1.27 


5.98 


6.39 


.32 


21 


11.30 


19.4 


5839 


2463 


37.8 


1.34 


5.53 


6.35 


.29 


22 


11.60 


19.3 


5833 


2447 


38.8 


1.31 


5.77 


6.32 


.30 


23 


11.60 


19.3 


5833 


2168 


34.4 


1.30 


5.35 


6.28 


.28 


24 


11.70 


20.4 


5836 


1830 


35.1 


1.21 


6.57 


6.19 


.32 
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BIOLOGICAL ABSTRACTS PREVIEWS VOLUME 52 





its. 


Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit . 


Ret. 


Per 

Profile 


Hit /Ret. 
Cit. 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


x 10” 5 
Issue Avg. 


1675 


36.4 


1.28 


6.20 


6.20 


.21 


.21 


.17 


.17 


3.59 


3.59 


1671 


37.1 


1.28 


7.54 


6.87 


.24 


.23 


.20 


.19 


4.16 


3.88 


1413 


35.3 


1.23 


7.19 


6.98 


.24 


.23 


.20 


.19 


4.19 


3.98 


1411 


34.4 


1.21 


7.80 


7.18 


.26 


.24 


.23 


.20 


4.53 


4.12 


754 


20.9 


1.14 


7.30 


7.21 


.28 


.25 


.35 


.23 


4.86 


4.27 


801 


22.2 


1.12 


7.34 


7.23 


.28 


.25 


.33 


.25 


4.88 


4.37 


717 


19.9 


1.08 


7.64 


7.29 


.30 


.26 


.38 


.27 


5.10 


4.47 


717 


19.9 


1.08 


7.55 


7.32 


.29 


.26 


.38 


.28 


5.04 


4.54 


789 


19.5 


1.12 


7.03 


7.29 


.28 


.26 


.36 


.29 


4.78 


4.57 


913 


22.9 


1.13 


7.76 


7.34 


.31 


.27 


.34 


.29 


5.28 


4.64 


1693 


33.8 


1.21 


7.30 


7.33 


.32 


.27 


.22 


.29 


5.46 


4.72 


3155 


46.3 


1.43 


5.93 


7.22 


.28 


.27 


.22 


.28 


4.83 


4.72 


2937 


43.1 


1.42 


5.00 


7.04 


.24 


.27 


.12 


.27 


4.07 


4.67 


2805 


41.2 


1.38 


5.71 


6.95 


.27 


.27 


.14 


.26 


4.65 


4.67 


3552 


52.7 


1.48 


3.16 


6.70 


.17 


.27 


.06 


.25 


2.89 


4.55 


2142 


33.4 


1.28 


5.81 


6.64 


.31 


.27 


.17 


.24 


5.34 


4.60 


2184 


36.4 


1.27 


5.53 


6.58 


.29 


.27 


.15 


.24 


5.00 


4.63 


2260 


37.6 


1.29 


6.05 


6.55 


.32 


.27 


.16 


.23 


5.46 


4.67 


2310 


38.5 


1.29 


3.99 


6.41 


.21 


.27 


.10 


.23 


3.60 


4.62 


2241 


37.3 


1.27 


5.98 


6.39 


.32 


.27 


.16 


.22 


5.41 


4.66 


2463 


37.8 


1.34 


5.53 


6.35 


.29 


.27 


.15 


.22 


4.91 


4.67 


2447 


38.8 


1.31 


5.77 


6.32 


.30 


.27 


.15 


.22 


5.10 


4.69 


2168 


34.4 


1.30 


5.35 


6.28 


.28 


.27 


.16 


.21 


4.72 


4.69 


1830 


35.1 


1.21 


6.57 


6.19 


.32 


.28 


.19 


.21 


5.50 


4.72 
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Terms 


Cits. 


Hits 


Cost/Profile 


Cost 


Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 

i* 


Per 

Profile 


Hit/Ret. 

Cit. 


Issue 


Avg. 


Issue 


1 


10.0 


20.0 


5743 


270 


20.7 


1.03 


4.12 


4.12 


.21 


2 


11.2 


19.2 


5600 


848 


59.4 


1.15 


8.91 


6.52 


.46 


3 


16.0 


16.7 


5743 


1183 


56.3 


1.54 


3.62 


5.55 


.22 










(Data for Issues 4-5 do not exist.) 




6 


16.0 


16.7 


5743 


1471 


70.0 


1.51 


3.97 


5.16 


.24 


7 


19.1 


15.5 


5743 


1858 


71.4 


1.47 


3.53 


4.83 


.23 


8 


17.2 


16.4 


5743 


1819 


62.7 


1.41 


3.28 


4.57 


.20 


9 


17.2 


16.4 


7710 


2131 


73.4 


1.44 


3.54 


4.42 


.22 


10 


21.9 


16.3 


7116 


3776 


104.8 


2.30 


3.47 


4.31 


.21 


11 


22.1 


14.5 


7157 


2179 


60.5 


1.51 


2.82 


4.14 


.19 


12 


23.5 


15.6 


8320 


2677 


68.6 


1.50 


3.55 


4.08 


.15 
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its. 


Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit. 


Ret. 


Per 

Profile 


Hit/Ret . 
Cit. 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


x ! 
Issue 


10- 5 

Avg. 


270 


20.7 


1.03 


4.12 


4.12 


.21 


.21 


.20 


.20 


3.82 


3.82 


i 848 


59.4 


1.15 


8.91 


6.52 


.46 


.34 


.15 


.18 


8.20 


6.01 


1183 


56.3 


1.54 


3.62 


5.55 


.22 


.30 


.06 


.14 


3.77 


5.26 


(Data for Issues 4-5 do not exist.) 














1471 


70.0 


1.51 


3.97 


5.16 


.24 


.28 


.06 


.12 


3.50 


4.82 


1858 


71.4 


1.47 


3.53 


4.83 


.23 


.27 


.05 


.10 


3.30 


3.52 j 


1819 


62.7 


1.41 


3.23 


4.57 


.20 


.26 


.05 


.10 


2.93 


4.25 | 


2131 


73.4 


1.44 


3.54 


4.42 


.22 


.25 


.05 


.09 


2.80 


4.05 


3776 


104.8 


2.30 


3.47 


4.31 


.21 


.25 


.03 


.08 


2.98 


3.91 


2179 


60.5 


1.51 


2.82 


4.14 


.19 


.24 


.05 


.08 


2.70 


3.77 


2677 


68.6 


1.50 


3.55 


4.08 


.15 


.23 


.07 


.08 


1.82 


3.58 

1 
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Terms 


Cits 


• 


Hits 


Cost/Profile i 


Issue 


Agg. 

Ratio 


Per 

Profile 


Total 


Ret. 


Per 

Profile 


Hit/Ret . 
Cit. 


Issue 


Avg. I; 


1 


25.3 


20.3 


6126 


2127 


33.2 


1.44 


2.35 


2.35 


2 


30.0 


20.7 


4385 


2206 


27.2 


1.67 


1.84 


2.10 


3 


27.1 


20.0 


5129 


2719 


36.2 


1.49 


1.95 


2.05 


4 


29.1 


19.9 


5823 


3229 


41.3 


1.63 


2.24 


2.10 


5 


29.0 


20.2 


5815 


3665 


48.8 


1.57 


2.80 


2.24 



(Data for Issues 6-26 do not exi 
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Cits 


• 


Hits 


Cost/Profile 


Cost/Term 


Cost/Hit 


Cost/Term/Cit. 


^ Total 


Ret. 


Per 

Profile 


Hit/Ret. 

Cit. 


Issue 


Avg. 


Issue 


Avg. 


Issue 


Avg. 


X 

Issue 


10- 5 

Avg. 


6126 


2127 


33.2 


1.44 


2.35 


2.35 


.12 


.12 


.07 


.07 


1.88 


1.88 


4385 


2206 


27.2 


1.67 


1.84 


2.10 


.09 


.11 


.07 


.07 


2.02 


1.95 


5129 


2719 


36.2 


1.49 


1.95 


2.05 


.10 


.10 


.05 


.06 


1.90 


1.93 


5823 


3229 


41.3 


1.63 


2.24 


2.10 


.11 


.11 


.05 


.06 


1.93 


1.93 


5815 


3665 


48.8 


1.57 


2.80 


2.24 


.14 


.11 


.06 


.06 


1.86 


1.92 



(Data for Issues 6-26 do not exist.) 
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Another check we made for several issues of CA was the num- 
ber of hits generated from each section of CA. This is not a 
regularly prepared data item but was done once to determine 
whether there were any sections that did not prove fruitful for 
our users. We thought we might be able to eliminate such 
sections from the search and thereby reduce cost--assuming the 
nature of the user group would not change in the area of the 
eliminated sections. We found that our users got hits from 
every section of CA. This is due to the fact that CSC has a 
very heterogeneous group of users . 

To summarize, while we at IITRI are providing retrieval in 
the very practical, almost business-oriented, mode, we are 
not merely feeding profiles and data bases into a matching 
machine. We are doing considerable research regarding the en- 
tire operational system. More date on this are given in Section 11. 

Without a totally controlled system in which vocabularies, 
data base formats, and record contents and formats are controlled 
and without a static software system, compiler, hardware con- 
figuration and operating system etc., there is no reasonable way 
to maintain an overview without maintaining and interpreting 
such data to guide future efforts. 

10.4.7 Personnel 

The personnel tasks involved in design and operation of a 
center include : 






management 
system design 

programming --develop and maintain software system 

including adapting to data base changes 



• profile coordinating and user liaison 

• keypunching —programs and profiles 

• clerical tasks --maintain records and distribute output 

• promotion and marketing 

• tape library maintenance 

CSC maintains weekly records of man hours per week per function 
in order to monitor current expenditures, monitor staff performance, 
determine profile costs and estimate future rates. 
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While data base leases and royalties, machine time, travel., 
purchased materials and postage are significant budget items 
the major expenditure in a center is personnel salaries. 

10.5 Marketing 

10.5.1 Mailings 

The Computer Search Center has used direct mail campaigns 
to acquaint large numbers of people with the CSC's services, as 
well as to inform selected groups of people about specific activ- 
ities. For example, approximately 5000 brochures sent to announce 
a CSC workshop on "Computer Retrieval of Scientific Information" 
serve not only to solicit the 20 or 30 workshop attendees but to 
help keep the CSC associated in people's minds with information 
retrieval. Mass mailings serve a publicity function rather than 
as a mechanism for directly soliciting SDI subscribers. The dates, 
number of items sent, recipients, and responses are listed for CSC 
direct mailings in the following list. 



Date No. 


Items Sent/Responses 


Recipients 


July 1970 


800/1 




Presidents of chemical com- 
panies with over 1000 employees 


September 1970 


approx . 


2000 
+ /95 
560 


IEEE subscribers 
IEEE midwestern members 


September 1970 


135/22 




Members of ACS Div. of Chem. 
Lit., Chicago Sec. 


November 1970 


275/22 




Major U.S. universities 


November 1970 


approx. 


5000/19 


ASIS Members and previous 
CSC contacts 


Spring 1971 


approx. 


2000/28 


Directors of corporate research 


March 1971 


approx . 


5000/32 


ASIS Members and previous 
CSC contacts 


November 1971 


approx. 


60/NA 


IIT trustees 


November 1971 


approx. 


5000/10 


ASIS Members and previous 
CSC contacts 


October 1971 - 
February 1972 


approx. 


1800/48 


Selected Standard Industrial 
Classifications with over 1000 
employees in 13 midwestern states 


February 1972 


approx. 


5000/22 


ASIS Members and previous 
CSC contacts 



O 

ERIC 



252 

284 



10.5.2 Press Releases 



Several announcements have been made to the press to 
publicize the Computer Search Center. In addition to news- 
papers and magazines that circulate to the general public, 
copies of the releases were sent to scientific and engineering 
journals in order to inform people involved with the communi- 
cation of scientific information about activities of the 
Computer Search Center. Dates and subjects of the releases 
are described below. 



Date 



Subj ec t 



July 1970 
November 1970 

March 1971 

Summer 1971 

November 1971 

January 1972 



Initiation of CSC subscriptions 

Workshop on "Computer Retrieval of 
Chemical and Biological Information 

Workshop on "Computer Retrieval of 
Chemical and Biological Information 

Advantages found by users of CSC 
SDI service . 

Workshop on "Computer Retrieval of 
Scientific Information" 

Workshop on "Computer Retrieval of 
Scientific Information" 



10. 5.3 Surveys 

10.5.3.1 IEEE REFLECS Survey 

A questionnaire was mailed to a sample of subscribers 
of journals published by the Institute of Electronics and 
Electrical Engineers and to a sample of IEEE members in the 
greater Chicago area. The questionnaire and descriptive liter 
ature about the REFLECS tape were prepared, in collaboration 
with the Information Division of IEEE. 

A great deal of interest in the tape was shown by 
respondents. Of 89 respondents, nearly 80 percent were inter- 
ested in a current awareness alerting program although a 
financial commitment could not be made in most cases. Respond 
ents replied anonymously unless they were interested in 
follow-up information, and 73 percent elected to provide names 
and addresses for further information. 



253 



285 



Although IEEE later decided not to produce the REFLECS 
tape, the information and insights obtained from the question- 
naire were used in developing and marketing services aimed 
at the engineering market. 

10.5.3.2 Food Technology Survey 

A telephone survey of 13 major food companies was con- 
ducted in five midwest states (Illinois, Minnesota, Wisconsin, 
Missouri, and Michigan) to assess the degree of interest in 
the International Food Information Service (IFIS) data base, 

Food Science and Technology Abstracts. Fifteen people in 13 
organizations responded. Of the 15, nine were favorable, five 
negative, and one undecided. Discussions are currently taking 
place with the Institute of Food Technologists regarding 
the establishment of IITRI as one of the two centers in the U.S. 
to handle IFIS tapes. 

10.5.3.3 Market Survey 

A market survey was made in 1970 to estimate the number 
of potential subscribers to the services of the Computer 
Search Center and to determine the interest in various data 
bases as a guide to Center expansion. The objective was to 
assess the potential user market in terms of size, location, 
experience, knowledge of data bases, preference for data bases, 
knowledge of computer information services, preference for 
information services and willingness and likelihood of paying 
for desired services. Because of the Center's existing 
services and current concentration in the chemical and biologi- 
cal fields, the survey concentrated primarily on the "Chemicals 
and Allied Products" industry. Universities, hospitals, and 
"Food and Kindred Products" industries were also surveyed. 

The survey was based upon statistical sampling. 

An analysis of the distribution of chemical process 
plants by region and state and manufacturing employment by 
industry revealed that Illinois is representative not only of 
the East North Central Region but also the U.S. As approxi- 
mately 70 percent of all industrial activity within the state 



I 

( 

I 

I 



of Illinois is located within the Chicago Standard Metropo- 
litan Statistical Area, data collected within this area were 
considered to be representative of the state, the East North 
Central Region, and the United States. 

Data were collected by in-depth personal and telephone 
interviews based on a Field Interview Guide prepared by the 
Center staff and Philip D. Wittlinger, Jr., of Kalish, Wittlinger 
and Associates, who conducted the survey. A copy of the 
Field Interview Guide appears in the following pages as 
Figures 10-34 to 10-39. A member of the Center participated in the 
field interviews so that the survey and subscription effort were 
combined to elicit information and to offer services at the 
same time. The survey data were used in establishing rate 
s true tures . 

Although selection of organizations for interviewing 
had been planned on a random basis, two factors necessitated 
a change in the selection technique. (1) The American 
Petroleum Institute commenced marketing its SDI service using 
CA Condensates. As most all petroleum and petrochemical companies 
are members of the API and have financial obligations and 
loyalty ties with the API, it was decided not to interview 
them during this program because their data inputs could 
bias our results. (2) Twenty-six organizations with fewer than 
100 employees that were contacted for the purpose of scheduling 
a personal interview, indicated that they had no need for a n 
SDI service. They either did not have an R & D activity, 
or simply did not utilize literature search techniques within 
their operations. 



On the basis of telephone contacts, and upon analysis of 
the number of employees within the Computer Search Center's 
client companies --all of which were organizations of over 100 
employees --it was decided that organizations with fewer than 
100 employees offered virtually no potential and should be 
excluded from further study in the survey. Thus, organizations 
within the petroleum/petrochemical industry and those with 
fewer than 100 employees were eliminated from the survey. 
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ORGANIZATION WAS AWARE OF THE AVAILABILITY OF COMPUTERIZED INFORMATION SYSTEM(S) 

Figure 10-34 
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Thirty organizations were surveyed and 70 individuals 
were interviewed during the course of the study. All or- 
ganizations had technical libraries and in most cases they 
were centralized. Most organizations employed either a library 
staff or information scientists. Twenty-six percent of the 
chemical-allied products companies and 40 percent of the 
hospitals did not employ a special staff for literature search- 
ing and dissemination. Most organizations were aware of 
computer information services although only 16 percent of 
chemical-al Lied product organizations, 50 percent of the 
universities, and 40 percent of the hospitals were currently 
using a current awareness alerting (SDI) service. Occasional 
use of such services was reported by 16 percent of the chemical- 
allied product organizations and 60 percent of the hospitals. 

Ten percent of the former category had in-house systems and 
another 21 percent were considering installation of in-house 
sys terns . 

Organizations that expressed little or no interest in 
SDI services were disinterested for one or more of the following 
reasons: (1) R & D efforts were in subject areas for which 

there is no currently-available data base; (2) R & D efforts 
were highly or totally applications oriented; or (3) organiza- 
tion compounded or blended products based upon R & D efforts 
of the supplier of the components of the products. 

Abstracting journals were rated first by a majority of 
respondents in all categories. Technical serials were ranked 
second by a majority of respondents in all categories but the 
hospitals where technical books were ranked second as an in- 
formation source by 40 percent of the respondents. 

An evaluation of general characteristics of an SDI system 
was made by respondents and their responses are summarized in 
Table 10-25. Labor saving, coverage, and thoroughness were consi- 

dered to be essential characteristics of a system by a plurality 
of respondents. 

Respondents also evaluated specific characteristics of 
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Table 10-25 

EVALUATION OF SDI SYSTEM CHARACTERISTICS* 
(PERCENT OF RESPONDENTS) 



IITRI's Computer Search System as these were described by 
interviewers. In considering Table 10-26, it should be borne 
in mind that the tabulated evaluations are based upon antici- 
pation and not working experience with the system. Significant 
characteristics included proximity to the center, no cost 
profile change, low cost profile switch, free form Boolean 
logic, truncation, and content and format of output cards. 
Multiple copy output and multilith output were rated unimportant 
by a majority of respondents. 
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(PERCENT OF RESPONDENTS) 



1C. 5.4 Pricing 



Our current subscription fees are shown on the cost 
sheets for CA, BA and El (Figures 10-40, 10-41, and 10-42, re- 
spectively). Initially, we had based our charges on a profile 
rather than on the present system of input and output units. 
However, since there were no restrictions on the size of an 
individual profile, there was an imbalance between our cost 
and the fees we charged. Profiles of two terms cost much 
less to run than profiles of two hundred terms, yet the sub- 
scription fees were the same. Compounding the problem was the 
fact that some economy -minded users took advantage of the free- 
form Boolean logic capability to ask several questions in one 
hug:/ profile. To combine three questions, they merely had to 
pu> each separate question's logic expression within parenthese 
and ^R" the three sets together. Weights could be used to 
segment the output into three sets. An evaluation made after 
several months of charging under the profile-based system 
indicated that 10% of the users, paying 10% of the fees, 
accounted for more than 40% of our costs. 

After the first year of operation, we changed owr fee 
structure to one based on units of input (search terms) and 
units of output (citations printed). This system more closely 
reflects our actual costs. No limitations to profile size are 
necessary and, if desired by the user, several questions can 
be combined in one profile. However, the cost will reflect 
the profile's size and number of citations retrieved. Since 
our statistics showed that over 75% of the profiles could be 
coded in 25 terms or less and would retrieve 50 or fewer cita- 
tions per issue searched, we set our basic input unit at 25 
terms and our basic output unit at 50 citations retrieved per 
issue searched. Supplemental units are based on each unit of 
1-10 search terms for input, and 1-50 citations retrieved per 
issue searched for output. Both input and output units are 
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averaged over the subscription period to even out minor 
fluctuations. We also give discounts for several profiles 
mailed to one address, reflecting our decreased handling costs 
for those cases. 

This subscription fee system is more equitable. Some 
users receive more service than others since they request 
changes more often, but we do not plan to charge for revisions. 
We think that such a charge might stifle legitimate reasons 
for change and denigrate profile performance. We have an 
accounting program to keep track of search terms used and 
output generated for each profile, so the system is not 
cumbersome to operate. Although the rates may change as data 
base sizes increase and costs go up, we will probably retain 
this basic structure. 
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COMPUTER SEARCH CENTER 

at NT Research Institute 
CHEMICAL ABSTRACTS CONDENSATES (CAC) 

( lii'tiiiciil Abst racts Service issues a OAC (apt* wri’U) , Koch (’AC (ape* corresponds to t In* 
wooUy printed issue of Chemical Abstracts. IVent y -s i\ weeks (issues) c#.* (AC comprise one 
volume; two volumes are published yearly. Odd numbered tapes cover sections I -:M (organic); 
even munbered tapes cover sections .ifi-Ht) (inorganic). (At includes citations tor each (Mil n 
in CA (about dm.ono annually) which roxers chemical literature throughout the world. 



SUBSCRIPTION STRUCTURE 




Supplement a l 
1 nput 

l- It) more terms 




HASH' OHITtrr 
l -no ei t at i OHS t ape 



Karh Supp l ciiumi t a l Output 
l-r»o more citations (hits printed) search 

(all above outputs averaged over IU mos . ) 




ANNUAL SUBSCRIPTION RATES 



CATEGORY 

( A- 1 

rA-a 



NUMBER 

ISSUES 

!iti ( e i t her even or odd ) 
(both even and odd) 



BASIC 

UNIT 

COMBINATION 



$’J5() 



EACH 

SUPPLEMENTAL 
INPUT OUTPUT 

$ tit) $ tit) 

$100 $ 1 ( K ) 



GROUP DISCOUNTS 



Ton ot- morn users within one oiRiinl/nl loi. (one ma i 1 1 n K address) mav sul.se- r 1 be „( ,ite 
redured rate, of «|.| R and *120 for CA-I and CA-2. respect, vely. These rale, are available 
immediately Wh ° n ,0 " <U ‘ " s “' s -r if . I ions within a :«> day period. If ten or more 

users enter subscriptions over a period longer than :i» days, their renewals will he at the 
discounted rate. 



HOW TO SUBSCRIBE 



prepayment] t "“ >SC '' ' P " H " S S,,0, " ,, SM,M "‘ < 1 «" "" or R anl,.aUo„-s pnrehase order with lull 



Mul.o cheeks payable to I IT RESEARCH INSTITUTE - CS C. 

Mail to: Martha i;. Williams 

Manager 

Computer Smirch Center 
10 West dfitli Street 
Chicago, Illinois OOOIH 



Figure 10- 40 
CA PRICE SHEET 
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COMPUTER SEARCH CENTER 

at I IT Research Institute 



BIOLOGICAL ABSTRACTS PREVIEWS (BA Previews) 

QA (issued biweekly) covers biological journals throughout the world and provides 140.000 
citations annually. BioHI (issued monthly) provides 100,000 citations annually and covers other 
biological publications such as symposia proceedings, government reports and conference papers. 
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SUBSCRIPTION STRUCTURE 



BASIC INPUT 
i-25 terms 
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Input 

1-10 more terms 
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1-50 e i tat ions, tape 
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Each Supplemental Output 
1-50 more citations (hits printed) 'search 

(all above outputs averaged over 12 mos.) 
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NUMBER 
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INPUT 


OUTPUT 
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BA- 1 


12 UiotU 


Sl 30 


S 45 


$ 45 




BA- 2 


21 HA 


S200 


S 75 


S i o 




BA- 3 


:Ui Ho III 


$250 


$100 


S100 



GROUP DISCOUNTS 

Ten or more* users within one organization (one mailing address) may subscribe at the 
reduced rates of $120. $170. and $220 for BA- 1 % BA-2, and BA-3. respectively. Those rates 
are available immediately when ten or more users enter subscriptions within a BO day period. 
If ten or more users enter subscriptions over a period longer than 30 days, their renewals 
will be at the discounted rate. 




HOW TO SUBSCRIBE 



All subscriptions should he* submitted on an organ i zat ion's purchase order with full 
prepayment . 

Make checks payable to NT RESEARCH INSTITUTE - CSC. 

Ma i I t o : Mu r t Ita K . Willi ams 

Mana grr 

Computer Search Cent er 
JO West 35Mi Street 
Chicago, Illinois (tOGlfi 
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COMPUTER SEARCH CENTER 

at I IT Research Institute 



COMPuterized ENgi peering inDEX (COMPENDEX) 

Engineering Index publishes monthly the COMPENDEX tape, n compilation of key 
engineering journals throughout the world. Over 3500 Journals, conference proceedings, 
and other publications are covered, providing over 84,000 citations annually. 
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INPUT OUTPUT 
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12 
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$75 $25 



GROUP DISCOUNTS 

Ten or more users within one organization (one mailing address) may subscribe at the 
reduced rate of $175 for El-1. This rate is available immediately when ten or more users 
enter subscriptions within a 30 day period. If ten or more users enter subscriptions over a 
period longer than 30 days, their renewals will be at the discounted rate. 



HOW TO SUBSCRIBE 

All subscriptions should be submitted on an organization’s purchase order with full 
prepayment. 

Make checks payable to I IT RESEARCH INSTITUTE - CSC. 

Mail to: Martha \i. Williams 

Manager 

Computer Search Center 
10 West 35th Street. 

Chicago, Illinois 60G16 



Figure 10-42 
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10.5.5 Brochures 

Of the many types of publicity used by the Computer Search 
Center, workshop and CSC brochures have probably received the 
widest circulation. Over 5,000 brochures announcing the latest 
workshop on Computer Retrieval of Scientific Information were 
sent to people who had had previous contact with the CSC or who 
were known to be interested in information science. The CSC 
brochure is used for all general publicity mailings, since it 
lists CSC services and gives examples of typical output. The 
workshop brochure is shown in Figures 9-1 and 9-2, and the CSC 
brochure is shown in Figures 10-43 and 10-44. 

10.5.6 Contacts 

Design, implementation, and development of the Computer 
Search Center have resulted in a great many contacts with in- 
formation scientists from other organizations, potential users, 
etc. Over the past four years, 1175 individuals in 719 distinct 
organizations have been in contact with Computer Search Center* 
personnel. These figures represent contacts made in person, via 
telephone calls or via individual correspondence. Individuals 
contacted as a result of a direct mailing are not included in 
the above numbers unless they responded by requesting further 
information. 
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10.6 Contacts and Cooperative Arrangements 

ASIDIC, the Association of Scientific Information Dissemin- 
ation Centers, was begun September 18-19, 1968. At that time, 
representatives of various centers providing services from 
machine-readable data bases developed by Chemical Abstracts 
Services met at CAS to discuss their mutual goals and problems. 
Members of IITRl's Computer Search Center were active at this 
formative meeting. A series of workshops followed. They were 
held at IITRI (November 13, 1968), the University of Georgia (Aug- 
ust 26-28, 1968 and February 27-28, 1969), and the University of 
Pittsburgh (June 17-18, 1969) and dealt with programming, pro- 
file development and inter-center relationships. By mid-1969, 
the group had grown both in size and interests, as many indus- 
trial, university, and not-for-profit organizations were involved 
in processing a variety of data bases. 

On October 22-23, 1969, ASIDIC offically came into being 
with the election of officers and development of a charter. 

Eugene Schwartz of IITRI served as the first president of ASIDIC. 

A pattern of two annual meetings developed. One, open to all, 
is devoted to annual business and items of general interest. The 
second retains the flavor of the earlier workshops and provides 
an opportunity for small group round-table discussions of common 
problems. The official purposes of ASIDIC are: 

• to promote applied technology of information storage and 
retrieval as related to large data bases containing bib- 
liographic, textual and fact information 

• to share experience and information through meetings, 
seminars and workshops 

• to recommend standards for data elements, formats and 
codes 

• to promote research & development for more efficient use 
of varied data bases. 

Full membership is reserved for centers providing services to 
over 100 users from two or more data bases (not internal). 
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IITRI has maintained a continued interest in and service 
to ASIDIC. Martha Williams is the current Vice President, a 
member of the Committee on Center-Supplier Relations, and 
chairman of the Cooperative Data Management Committee, which 
recently compiled an extensive survey of centers and services. 
Peter Schipma has been an active member on the Standards Commit- 
tee since its inception. 

Over 20 data base suppliers and a similar number of centers, 
universities, industrial organizations, and government agencies 
have been contacted concerning possible data base use or in- 
formal networking. These discussions are continuing at the 
present time. Foreign countries with which contacts have been 
made inc lude : 



Argentina 

Austrailia 

Austria 

Belgium 

Brazil 

Canada 

Ceylon 

Chile 

Czechos lovakia 

Denmark 

England 

Finland 

France 

Germany 



Netherlands 



Hungary 



Thailand 

Union of South Africa 



Israel 

Italy 

Japan 

Korea 

Mexico 



India 

Ireland 



Spain 

Sweden 
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11. RESEARCH STATISTICS. COMPUTATIONAL LINGUISTICS AND 

ANALYTICAL STUDIES 

In order to provide good service to users and to gain in- 
sights that may lead toward future developments within or 
related to CSC, we maintain statistics on and conduct research 
related to various aspects of users, data bases, systems, and 
personnel. Statistics and records are maintained and research 
is conducted in an effort to: 

improve profiles 
monitor user response 
monitor data bases 

improve methods of using data bases 
suggest improvements for data bases 
observe trends 

devise cost accounting procedures 
monitor program efficiencies 
improve search strategies 
obtain data for future planning 
improve system 

monitor and project personnel needs 
generate data for further study. 

In addition to the data base statistics provided in Section 
6 and production statistics in Section 10, we maintain a variety 
of statistics on system features, profile terms, profiles, and 
hits (output) . 

11. 1 System Features 

The CSC system includes certain design features which were 
employed following a study of the desirable and desired features 
of systems we analyzed during the design phase of C6156. We 
have since analyzed CSC profiles to determine the extent of use 
of the design features: linking, truncation, variable term 

types, free form Boolean logic, and weights. 

11 - 1 - 1 Linking of Terms (See Section 4.2.3) 

L .nks or groups are employed extensively by users of the 
IITRI system. For example, in a typical run against an odd 
numbered issue of CA Condensates, 94% of all the terms used in 
all profiles were included in links. Only 6% of the terms were 
referred to individually i n the profile logic. While 6% of the 
total number of terms in the run were not in links, only 5.6% 
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of the profiles (user questions) used no links at all. The 
majority of the profiles, 73.6%, used one to four links, 20.8% 
used five to ten links, and none used more than ten. 

The number of terms in a single link has varied from one 
to 120. However, a more normal range is the range of one to 
38 observed in the run under discussion. The average number of 
terms per link is eight. 

11.1.2 Truncation and Various Data Types 
Truncation can be used with any kind of data element or 
term type in a given data base. An analysis of the use of the 
various truncation modes (none, left, right, and both left and 
right simultaneously) versus term type, indicates that, when 
searching an issue of CA, the search terms that users truncate are 
text terms (index and title terms), author terms, CODEN, CA section 
numbers, and corporate authors. As one might assume, subject 
terms or text terms comprise the majority of the terms in pro- 
files, followed by CA Section number, CODEN, author and corpor- 
ate author. In fact, by term type, 93.2% of the terms were sub- 
ject or text terms, 2.0% were authors, 2.3% were CA section 
numbers, 2.1/ 0 were CODEN, and 0.4% were corporate authors. 

Table 11-1 gives the numbers of terms and various term 
types vs. truncation modes used in a particular run. 



Naturally, right truncation is the most commonly used 
truncation mode. As can be seen in Table 11-1, of the text terms, 
54.8% are right truncated, 26.3% are not truncated at all, 16.3% 
have simultaneous left and right truncation, and 2.6% are left 
truncated. Note that the individual left and right truncation 
modes do not include the instances of both left and right trun- 
cation, hence if one wanted to know all instances of left trun- 
cation, and not merely left and only left, he could add the 
numbers from the "both" line to the numbers for left truncation 
(and similarly to the numbers for right truncation.) Thus, 
using the numbers in Table 11- 1 for text terms, all instances 
of left truncation would be 18.9% (2.6+16.3) and all instances 
of right truncation would be 71. 1%(54. 8 + 16.3) .. CA section 
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NUMBER OF TERMS OF VARIOUS TERM TYPES 
VS. TRUNCATION MODE USED 



numbers and corporate authors are either right truncated or not 
truncated at all. Left truncation would be of no meaningful 
use. Right truncation on a CA section number would allow a 
user to pick up 10 sections in biochemistry with the single 
truncated term CA01*. CA01* will cover sections 10 through 19. 

When truncation is used with author names it is usually 
right truncation and is helpful in picking up names that are 
spelled differently in a foreign language and transliterated 
in several ways. Left truncation on an author name will re- 
trieve variant representations of names such as O'Hara where 
the spacing between the "0" and the "H" might vary and the 
punctuation might be included in some cases and not others. 

In the case of CODEN, truncation is little used but valu- 
able when needed. There is no need to truncate the CODEN for 
a specific journal, in fact to do so would provide false re- 
trieval. In the case of conferences and proceedings , which 
are designated by a one or two in the first position of the 
CODEN, right truncation can be used. Simultaneous left and 
right truncation on patent CODEN is used. The third and fourth 
positions in the CODEN for patents are designated XX, and one 
can use the truncated search term *XX* to retrieve all patent 
references . 

Table 11- 2 shows the number of profiles, in a run, con- 
taining various term types with the truncation modes used. 

Table 11- 3 shows the percent of profiles containing the various 
term types versus the truncation mode used. 

Truncation has been employed by all of the participants in 

the CSC SDI program. Considering all the profile terms in sev- 
eral runs : 

No truncation was used for A6% of the terms 
Left truncation was used for 5% of the terms 
Right truncation was used for 36% of the terms 
Both truncation was used for 13% of the terms 



- Denotes truncation 
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These statistics, initially generated on a computer -manual 
basis, are now completely machine generated. (See Table 11-2). 
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STATISTICAL 0UTFUT FROM INPUTR 
(profiles for CA77:01) 
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PERCENT OF PROFILES CONTAINING VARIOUS TERM TYPES 
VS. TRUNCATION MODE USED 



I 



11.1.3 Free Form Boolean Logic 

During analysis of profiles run against CA Volume 76, 
issues 25 and 26, we determined that 86.9% of the profiles used 
AND logic, 77.6% used OR logic, and 32.4% used NOT logic. 

Table 11-4 indicates the number of times each logic operator 
was used within a profile. For example, 35 profiles or 13.1% 
of the profiles did not use AND logic; 67 profiles or 25% used 
the AND operator only once; and four profiles or 1.5% of the 
profiles used AND ten or more times. The frequency of use 
of OR logic is similar to that of AND. NOT logic, while used 
in a larger percentage of profiles than one might suspect, is 
not used very frequently within a single profile. It is used 
in 32.54 of all profiles — once in 27.6% of the profiles, 
twice in 3.2% and three times in 1.1% of the profiles. The 
NOT operator is not used more than four times in any profile. 

The CSC search system allows any number of parenthetic 
logic statements in a profile and they can be nested to any 
degree. Table 11- 5 indicates the number of sets of parentheses 
found in the same group of profiles. Sixty- five profiles or 
24.34 used no parentheses, and 75.7% did use parentheses. 
Thirty-nine profiles or 14.2%, used one set of parentheses, 
profiles or 18.74 used two sets, etc. The purpose of 
this analysis is to indicate the fact that where permitted to 
use free logic the user does make use of that feature. The 
number of sets of parentheses is some indication of the degree 
of complexity and length of the search question. The actual 
use of nested logic is given in table 11-6. 

11.1.4 We ighting 

Weights were used in 24.14% of all profiles run against 
CA Volume 76. (24.76% for the even numbered issues and 23.52% 
for the odd numbered issues). This is an increase over the 
11.6% use experienced in Volume 71. The reason for the increase 
is most likely due to our change in our basis for pricing. 
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NUMBER AND PERCENT OF PROFILES USING AND, OR, and NOT LOGIC 
VS. NUMBER OF TIMES EaCH OPERATOR WAlTUSElT IN A MSFILE 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 76 



Number of Sets of 

Parentheses Number of Profiles % Profiles 



0 


65 


24.3 


1 


39 


14.6 


2 


50 


18.7 


3 


29 


10.8 


4 


24 


9.0 


5 


18 


6.7 


6 


10 


3.7 


7 


7 


2.6 


8 


3 


1.1 


9 


9 


3.3 


10 (or more) 


14 


5.2 



Total 268 100.0 
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Table 11- 5 

USE OF PARENTHETIC LOGIC IN PROFILES 
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CHEMICAL ABSTRACTS CONDENSATES VOLUME 76 



Highest Degree of Nesting 



of Parentheses 


Number of Profiles 


7o Profiles 


0 


65 


24.3 


1 


89 


33.2 


2 


70 


26.1 


3 


29 


10.8 


4 


13 


4.9 


5 


2 


.7 


Total 


268 


100.0 



Table 11- 6 

USE OF NESTED LOGIC IN PROFILES 



Initially there was no limit to the number of terms a user could 
put in a profile. Later, when we found 10% of our users were 
costing 407o of the machine time, we decided to assign a term 
limit of 25. This encouraged users to try to use all of their 
25 terras, hence a user with a one term profile would combine 
his with one or two other users from the same company. Because 
of the flexibility of the logic system they could specify three 
profiles as one and separate the questions with OR logic opera- 
tors. Faced with the problem of combined output they would then 
assign zero weight to one question and two distinct weights 
(high and low weights) to the other questions. The net result 
was that the zero weighted profile's output would be printed first, 
the low weighted one’s second, and the high weighted ones last. 
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11.2 Terms --Profiles 



A retrieval system that involves natural language terms is 
bound to be term oriented, i.e. , the crux of the system in- 
volves matching the intent of a user's question with the intent 
of a tit led- indexed reference, and the match takes place through 
terms--either terms jer se or terms that have been coded, trun- 
cated, classified, etc. The terms of the profile and the terms 
on the data base are of great importance. The profile terms 
are designated by the CSC profiler and/or the user, and data 
base terms by the supplier. 

After checking the user aids in order to exercise what 
control we can on profile terms (term frequencies and term frac- 
tion occurrences) we prepare complete profiles incorporating 
appropriately truncated terms and logic, etc. 

Aggregation is the preparation of one sorted list contain- 
ing one occurrence only of each term from the total batch of 
a ll profile terms in a run. The larger the profile term list 
the greater the benefits of aggregation are, and conversely, if 
a term list is reduced or split into two batches for separate 
runs the benefits are diminished. A term that appears in sev- 
eral profiles appears only once in the aggregated word list to- 
gether with information concerning the profiles in which the 
term appears. The programming aspects of aggregation have been 
discussed in Section 5 under the INPUTR program. Aggregation 
serves several purposes. It effects a savings in search time re- 
quired--if a term is used in multiple profiles it need only be 
searched once. An alphabetical profile term list is printed out 
for all terms used in all profiles in a given run. This shows 
spelling errors in profile input that should not but occasionally 
do occur. It also shows variation in truncation which may be either 
intentional or wasteful. One cannot automatically determine where 
to truncate on a term, as the content of two or more profiles using 
common term fractions may diffe^ resulting either in loss of 
relevant information or in an overabundance of false hits. The 
aggregation feature was included in the initial program design in 
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1968 and has proved to have economic benefit. In our first 
production run we had only 800 profile terms before aggregation and 
these were reduced by 15.7% to 674 terms actually submitted for 
searching. When we reached 3758 profile terms, we achieved a reduc- 
tion of 29.5% to 2650 terms. 

The aggregation ratio is dependent on the number and 

character of profiles in a run. Homogeneity of profiles in- 
creases the likelihood of identical terms being used in more 
than one profile, and in a large number of profiles the number 
of occurrences of specific terms is likely to be higher. Aggre- 
gation is affected by use of Standard truncations. (See Section 
7.4). Term aggregation for profiles run against issues of CA, 

BA, and El are shown in Figures 11-1 through 11-9. These 
numbers expressed in terms of an aggregation reduction ratio 
are presented in Figure 11-10 through 11-18. The average 
number of terms per profile vs. issues of CA, BA, and El are 
given in Figures 11-19 through 11-27. The average number de- 
creased once the free pilot runs terminated and the subscrip- 
tion fees were introduced. The average reached in Volume 72 
was 34. We announced our prices and the averages started to 
decrease. The current average is 24. 

Cost per term vs. issue and cost per term per citation 
vs. issue are given for CA, BA, and El in Figures 11-28 through 
11-45. The cost per profile for each of the issues searched is 
given in Figures 11-46 through 11-54. The cost/profile for 
searches of CA have steadily decreased from approximately $11.00/ 
issue to $1. 75/issue for Volume 76. This decrease is due to 
continued efforts to increase the efficiency of the software. 
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11.3 Hits (Output) 

11.3.1 Hits--Profiles 

CSC statistics generation programs produce data regard- 
ing the average numbers of hits per profile per issue of each 
data base --maximum, median and mean. The number of hits 
affects the royalties we pay to data base suppliers and hence 
our price structure. Some users cost us more in royalties 
because they generate more hits. With a print limit of 50 
for the base subscription fee the average user is not con- 
s trail'd to try to cut down number of hits to avoid incurring 
added cost. The number of hits per profile per run ranges 
from 0 to 359. The average mean number of hits retrieved per 
P ro ^^ e P er issue is 25 and the median is 16. This is de- 
pendent on data base size, hence a larger issue is likely to 
produce more hits per profile. This is true with the exception of 
maverick cases where inadvertantly a high frequency term is 
entered in an unrestricted manner thus generating an inordinate 
humber of hits for one profile. 

The average number of hits per profile per issue for CA, 

BA, and El are given in Figures 11-55 through 11-63, and nor- 
malized hits are presented in Figures 11-64 through 11-69. 

They are normalized to the average number of citations per 
issue for the volume in question. 

While the mean number of hits per profile is 25 there 
are some profiles that get zero hits . Zero hit profiles can 
indicate several things: 

(1) inappropriate data base, 

(2) inappropriate issue of data base, 

(3) overly specific terms, 

(4) too tight logic, or 

(5) desired output. 
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11.3.2 Hits --Retrieved Citations 

The average number of hits per retrieved citation per 
issue tells how many of the profiles are retrieving the same 
citation and therefore indicates, to some extent, the homogen- 
eity of the total user group. 

Currently we are averaging 1.6 hits per retrieved cita- 
tion per issue. If all hits were printed (and 95% are), the 
center would pay 1.6 times the royalty fee for a given hit. 
Appriximately 60% of the citations in the data base are found 
as hits for one or more profiles. This is true of CA, BA, and 
El. Naturally the number of profiles must reach approximately 
100 for this to be true. After that point the percentage 
does not seem to increase, though seemingly with an extremely 
large number of heterogeneous profiles the percentage would 
probably increase asymptotically to 99+ . Unfortunately we 
have not had the opportunity of checking this out. 

The number of citations that are hits is probably a func- 
tion of the heterogeneity of the profile group. Or, it may be 
related to the fact that some citations have titles with no 
definitive terms and are also poorly indexed. 

Hits per retrieved citation per issue for CA, BA, and El 
are given in Figures 11-70 through 11-78. Normalized hits per 
issue are given in Figures 11-79 through 11-84 and CSC machine 
cost per hit per issue is given in Figures 11-85 through 11-93. 

11.3.3 Printed Hits 

Not all citations that are hits for a profile are necessar- 
ily printed. A user may specify a print limit. Though most 
hits are printed some are not. The mean number of prints per 
profile per issue is 23.5 and the median is 15. 

The number of prints affects center cost somewhat but 
printing cost is minimal (2% of total run) as we print off- 
line at significantly lower rates . We print approximately 
150 K lines/week. While printing cost is low, postage for 
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I shipping large number's of cards at 1st class rates and the 

purchase of the card stock is a real cost; 

e.g., card 8.0 mils 

postage 4.0 mils 

print 1.1 mils 

13.1 mils or 31.2c per profile. 

The number of hits and prints generated by each user 
organization is a statistic that is automatically generated. 

The corporate distribution of hits and prints is the same 
as profile hit distribution. It shows which companies are 
generating high numbers of hits (i.e., costing more) and it 
is tabulated by profile within company. This is an indicator 
for the center or user-company profile-coordinator as to which 
profiles are generating how many hits. 

CSC uses these data in estimating profile subscriber fees 
for the next year; e.g., with a print limit of 50 for the 
base fee and added cost thereafter, a user who gets a large 

number of hits can predict the number of dollars he will need 
for the next year. 




361 



I 




1 




HITS PER RETRIEVED CITATION VS. ISSUE 



! 




o 

ERIC 



363 



f 





394364 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 75 



vO 

<N 




365 

3S5 



HITS PER RETRIEVED CITATION VS. ISSUE 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 76 



8 10 12 14 16 18 20 22 24 26 




Figure 11-74 

AVERAGE HITS PER RETRIEVED CITATION VS. ISSUE 



366 

3£6 



I 




367 



HITS PER RETRIEVED CITATION VS. ISSUE 



f 



1 




i 



HITS PER RETRIEVED CITATION VS. ISSUE 



f 




m 



ENGINEERING INDEX COMPENDEX VOLUMES 71, 72 



I 

I 

VOLUME 71 VOLUME 72 

2 4 6b 10 12 2 4 6 tt 





j 

* 

I ' 




Figure 11-78 

HITS PER RETRIEVED CITATION VS. ISSUE 

370 

4C0 



10 12 



* 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 71 
2 * 6 8 10 12 14 16 18 20 22 24 26 




0 I 

ERIC 



Figure 11-79 

NORMALIZED HITS VS. ISSUE 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 72 



I 

I 



2 4 6 8 10 12 14 16 18 20 22 24 26 




Figure 11-80 

NORMALIZED HITS VS. ISSUE 



O ■ 




372 



402 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 73 




! 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 74 



4 6 8 10 12 14 16 18 20 22 24 26 



5400 

5200 

5000 

4800 

4600 

4400 

4200 

4000 

3800 

3600 

3400 

3200 

3000 

2800 

2600 

2400 

2200 




ERLC 



- ‘•♦V 



Figure 11-82 

NORMALIZED HITS VS. ISSUE 



404 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 75 



I 



O 

ERIC 



6000 

5800 

5600 

5400 

5200 

5000 

4800 

4600 

4400 

4200 

4000 

3800 

3600 

3400 

3200 

3000 

2800 

2600 



2 4 6 8 ■ 10 12 14 16 18 20 22 24 26 




Figure 11-83 

NORMALIZED HITS VS. ISSUE 
375 



4CS 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 76 




2 4 6 8 10 12 14 16 18 20 




t 



I 

I 

J 



J 



] 

I 

I 

er|c| 



CVJ 



co 



h4 

£ 

CO 

w 

s 



o 

o 

CO 

H 

a 



co 

5 

h A 

d 



a 




-Cl- 

eg 



CVJ 

CVJ 



o 

CVJ 



r-l 00 
r-l 



S VO 

3 " 

> <f 



CVJ 




in 

00 

I 



CD 

u 

3 

00 

•rl 



i i i 



to 


o 


in 


o 


in 


o 


to 


o 


m 


o 


to 


m 






CO 


co 


CVJ 


CVJ 


r-l 


rH 


• 


• 


• 


• 


• 


• 


• 


• 


• 


• 


o 


o 


o 


o 


o 


o 


o 


o 


o 


o 


<o- 


•CO* 


<o* 


•CO* 


•CO* 




<o* 


•CO* 


•CO* 


•CO* 



377 



407 



COST PER HIT VS. ISSUE 



I 



I 

I 

I 





COST PER HIT VS. ISSUE 




COST PER HIT VS. ISSUE 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 75 



2 4 6 8 10 12 14 16 18 20 22 24 26 



$0.15 



$0.14 



$0.13 



$ 0.12 



$ 0.11 



I 

I 



$ 0.10 




Figure 11-88 
COST PER HIT VS. ISSUE 



O 

ERIC 



380 

410 



I 



CHEMICAL ABSTRACTS CONDENSATES VOLUME 76 




I 

[ 




Figure 11-89 
COST PER HIT VS. ISSUE 

381 

411 



BIORESEARCH INDEX VOLUMES 70, 71 



T 

-X- 

o 

ERIC 



VOLUME 70 VOLUME 71 

10 12 2 4 6 8 10 12 




COST PER HIT. VS. ISSUE 



382 



412 



* 



BIOLOGICAL ABSTRACTS PREVIEWS VOLUME 51 



I 

I 

I 

f 

I 



I 

I 

I 

I 

O . 

ERIC 



2 4 6 8 10 12 14 16 18 20 22 24 26 




COST PER HIT VS. ISSUE 

413 



383 



■ 



I 

I 

I 



I 

© I 

ERIC] 



BIOLOGICAL ABSTRACT PREVIEWS VOLUME 52 



2 4 6 8 10 12 14 16 18 20 22 24 26 




COST PER HIT. VS. ISSUE 

414 



384 



ENGINEERING INDEX COMPENDEX VOLUMES 71, 72 



I 

O 

ERIC 



VOLUME 71 VOLUME 72 

2 4 6 8 10 12 2 4 6 b 




Figure 11-93 

COST PER HIT VS, ISSUE 
385 



415 



10 12 



12. 



CONFERENCES. PRESENTATIONS . PUBLICATIONS. AND PROFESSIONAL 
ACTIVITIES 



Computer Search Center personnel have participated exten- 
sively in the professional concerns of the information community 
Activities in this field have proven to be a valuable source of 
two-way communication between the Computer Search Center and 
other information-processing organizations. For example, a 
follow-up study of the November 1969, joint meeting of the 
Chicago Sections of the ACS--Division of Chemical Literature, 
SLA, and ASIS has shown that of the 54 organizations that atten- 
ded, 21 have become either trial users or subscribers of CSC. 
Additional beneficial contacts have resulted from the activities 
described below. The items below are arranged by professional 
organization, and within organizations are listed offices held, 
followed by presentations and publications in chronological 
order. 

American Chemical Society 
Publications and talks: 

Fanta, P.E. , Schwartz, E.S., and Williams, M.E., "Modern Tech- 
niques in Chemical Information," presented at the Third Great 
Lakes Regional Meeting of the American Chemical Society at 
Northern Illinois University, DeKalb, Illinois, June 5, 1969. 

Williams, M.E. and Schipma, P.B. , "Design and Operation of a 
Computer Search Center for Chemical Information," presented at 
the American Chemical Society meeting in September 1969 and 
published in the Journal of Chemical Documentation, Vol . 10, 

No. 3, September 1970. 

Williams, M.E. , "information Sciences at IIT Research Institute, 
seminar for a Joint Meeting of the Chicago Chapter of the 
American Chemical Society, Special Libraries Association, and 
the American Society for Information Science, November 1969. 

"Searching the Scientific Literature by Computer," exhibit at 
the September 1970 ACS meeting. 
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Williams, M.E., "Computer Based Information Retrieval for 
Chemists," presented at the Rock River Valley Chapter of the 
American Chemical Society, Rockford, Illinois, March 25, 1971. 

Williams, M.E., "Linguistic Aids for Searching Data Bases," 
presented at the 3rd Central Regional Meeting of the ACS, 
Cincinnati, Ohio, June 8, 1971. 

Williams, M.E., "Computer Search Center, « presented at the 5th 
Great Lakes Regional Meeting of the ACS, Bradley University 
Peoria, Illinois, June 11, 1971. 



American _ Spciety for Information science 
Offices Held: 



Williams, Martha E. Councilor-at -Large, 1971-72 

Publications Committee, 1971-72 
Chairman, Committee on 

Inter-Society Cooperation, 1972 



Publications and talks: 

Schipma, P.B., Williams, M.E., and Shafton, A.L., "Comparison of 
Document Data Bases." Journal of the American Society for 
n ormation Science, Vol. 22, No. 5, September-October 1971. 

Preece, s.E.,"Data Base Support for an SDI System," presented 

L IT 3 ^ ld " Y f ar Regional Conference of the American 
society for Information Science, Dayton, Ohio, May 18-20, 1972. 



Schipma, P.B., "PL/1 as an Information Retrieval Languaqe " 

?h at ^ he . First Annual Mid-year Regional Conference of 
May 18?20? a i972? ietY ^ Informati0n Science, Dayton, . Ohio, 



Stewart, A.K. and Williams, M.E. 
Transfer and SDI," submitted as 
ASIS annual meeting, Washington, 



, "International Information 
a contributed paper at the 1972 
D.C. 



Association fo r Computing Machinery 
Offices held: 



Williams, Martha E. Publications Board, 1972-73 
Publications and talks: 

Ondensin, E.M. , "The Least Common Bigram: A Dictionary 

s^rnh amen !i Technic l ue for computerized Natural -Language Text 

AuneJ? ? 9 s ?£?? entea at the 1971 ACM National Conference 631 *” 
August 3-5, 1971, and published in the Proceedings? ' 
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Association of Scientific Information Dissemination Centers 
Offices held: 

Schipma, Peter B. Standards Committee, 1969-1972 

Schwartz, Eugene S. President, 1969-1970 

Williams, Martha E. Vice-President, 1971-1972, 1972-1973 

Chariman, Cooperative Date. Management 
Committee, 1970-1972 
Committee on Center Supplier Relations, 
1971-1972 



Publications and talks: 

Williams, M.E., "The Information Center of 1976," presented at 
the Association for Scientific Information Dissemination 
Centers meeting in Atlanta, Georgia, March 1970. 

Schipma, P.B., "Term Fragment Analysis for File Inversion," 
presented at the NFSATS/ASIDIC Joint Meeting, Washington, D.C. , 
February 23, 1971. 

Schwartz, E.S., "The Information Process: Relationships, 

Problems and Limits," presented at the NFSAIS/ASIDIC Joint 
Meeting, Washington, D.C. , February 23, 1971. 

Williams, M.E., "Cooperative Data Management for Information 
Centers," presented at the NFSAIS/ASIDIC Joint Meeting, 

Washington, D.C. , February 24, 1971. 

Williams, Martha E. and Stewart, Alan K. , "ASIDIC Survev of 
Information Center Services." June 1972. 

National Academy of Scienc e s, National Research Council, 

Committee on Chemical Information 

Offices held: 

Williams, Martha E. Committee Member, 1970-1972 

Chairman, Large Data Base 
Subcommittee, 1971-1972 

Publications and talks: 

Presentation and discussion concerning the Computer Search 

Center at the January 13-14, 1972 meeting held at Chicago, Illinois. 

Large Data Base Survey, 197 2. 

National Federa tion of Science Abstracting and Indexing Services 

Williams, M.E., "Computer Based Services," seminar presented at 
the National Federation of Science Abstracting and Indexing 
Services, New York, April 27-29, 1970 and Cleveland, Ohio, 

May 25-27, 1970. 

Schipma, P.B., "Technological Aspects of Computer Based 
Services," presented at Seminar of the National Federation of 
Science Indexing and Abstracting Services in Chicaoo 
May 10-11, 1971. * ' 



388 

418 



Williams, M.E., "Information Center--Case History," presented 
at the NFSAIS Computer Based Services Seminar, Chicago, Illinois, 
May 10-11, 1971. 

Williams, M.E., "Case History--IITRI, " presented at the NFSAIS 
Indexing in Perspective Seminar, Chicago, Illinois, May 24-26, 
1971. 

Schipina, P.B., "Technological Aspects of Computer Based Services," 
presented a •: Seminar of the National Federation of Science 
Indexing and Abstracting Services in New York, February 3-4, 1972. 



Miscellaneous 
Publications and talks: 



"Computer Search Center , " Science Information Notes , Vol. 1, 

No. 3, May-June 1969, pp. 107-110. 

Williams, M.E., "An Information Retrieval System," presented at 
the American Management Association seminar on Fundamentals of 
Information Retrieval Systems and Techniques, San Francisco, 
California, June 5-7, 1968. 

Williams, M. E. , "The Information Problem," presented at the 
Institute on Information Resources, Networks, and Retrieval, 
Department of Engineering, University of Wisconsin, Madison,' 
Wisconsin, November 11-12, 1963. 

Schwartz, E. S. , "Heuristic Retrieval: Variable Search Strat- 
for Identification, " Journal of Chemical Documentation 
Vol. 9, No. 1, 1969, pp. 31-46. 

Schwartz, E.S. and Williams, M.E., HIT Research Institute) and 
Fanta, P.E., (Illinois Institute of Technology), "Modern Tech- 
niques in Chemical Information (Workbook and Syllabus ) , » 
February 1969. To be published. 

Williams, M.E. , "Content Analysis of Documents: An Analytic 

View," presented at the American Management Association seminar 
on Fundamentals of Information Retrieval Systems, San Francisco 
California, June 21-25, 1969. 



Williams, M.E. , "Computer Search Center— A One Stop Information 
Center for Chemical Librarians," presented at the Chemists' 

Club Symposium, New York, April 9, 1970. 

M ‘ E *' " Desi 9 n of Data 3ase Systems and Identification 
Ap“il S 15 El l970 tS ' " pr33ented at EDUCOM meeting, Boston, Mass., 



Williams , 
Libraries 

Williams, 
Paper No. 
Engineers , 



M.E., "SDI Whither?" presented at the annual Special 
Association meeting in Detroit, Michigan, June 9, 1970. 

M.E., "Provision of Information to the Research Staff," 
P resentsd at the American Institute of Chemical 
63rd Annual Meeting, Chicago, Illinois, December 3, 1970. 
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Williams, M.E., "New Techniques of Information Handling," 

Paper No. 14C presented at the American Institute of Chemical 
Engineers, 63rd Annual Meeting, Chicago, Illinois, 

December 3, 1970. 

Williams, M.E., "Computer Searching of Multiple Machine-Readable 
Data Bases," presented at the National Library Week Symposium II, 
Information for the Seventies, Minneapolis, Minnesota, and 
published in April 20, 1971, MnU Bulletin, Vol. 2, No. 3, Julv 

1 QT1 •* 



Williams, M.E., "Data Base Utilization — Information Center and 
Related Applications," presented at the Colloquium on Machine- 
Readable ^Data Bases their Creation and Use, sponsored by the 
School of Library Science, State University of New York, Albanv 
New York, April 21, 1971. * ' 



"Computerized Information Services 
Issued by the Chemical Division of 
May 1971. 



for Chemists," chemistry News, 
IIT Research Institute,' 



Williams, M.E., "Integration of a Processor-Supplied Data 
with a Standard Center-Oriented System," presented at the 
Chemical Abstracts Services--CA Integrated Subject File* Us 
Seminar, Columbus, Ohio, May 24, 1971. 



Base 
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Williams , 

Scientists 

Annapolis, 



M.E., "Use of Machine-Readable Data Bases by 
and Engineers," presented at the ASEE annual 
Maryland, June 24, 1971. 



meeting. 



Ex P eriences of IIT Research Institute in Ooer- 
Computerized Retrieval System for Searching a Variety of 
Data Bases," presented at the 3rd Ccanfisld International Y 
Conference on Mechanized Information Storage and Retrieval 

jXw e 20' f£f iel 5, « Technology 9 Cran£ Kid England, 

Vol Y 8 °'no 97 2' r"r \l lt l ? In f “ration Storage and Retrieval, 
voi. u, no. 2, pp. 57-75, April 1972. 

WULians, M.E., "Handling of Varied Data Bases in an Information 
Environment," Proceedings of Conference on Computers in 

dS nSrjS 2r?97i’: ,,0rthern university, 

Schipma, P.B., "IITRI's Computer Search Center" nrpwnfo^ =.+- 

MuitiplsData* Bas gl&lTJ l™?™* 
at the INTREX Seminar, MIT, October 23, 1971. ' P °° en "- ed 



13. 'REFERENCES 



A complete list of papers and presentations made by CSC 
staff members is given in Section 12. This section contains 
the papers referenced in earlier sections and a listing of data 
base documentation. 

13.1 Papers Referenced 

1. K. D. Carroll (Compiler & Editor): Survey of 

Scientific Technical Tape Services. AIPID 70-3. ASIS SIG 
SIG/SDI September 1970. 

2. L. Cohan (Editor): Directory of Computerized 

Information in Science and Technology. Science 
Associates/International, Inc., New York. 

3. M. E. Williams: Cooperative Data Management for 

Information Centers. Presented at the Association 
of Scientific Information Dissemination Centers 
Meeting, Washington, D. C., February 24, 1971. 

4. P. B. Schipma , M. E. Williams and A. L. Shafton: 

Comparison of Document Data Bases . Journal of 
the American Society for Information Science, 

Vol . 22, No. 5, September-October 1971. 

5. M. E. Williams: Handling of Varied Data Bases 

in an Information Center Environment. Presented 
at the Conference on Computers in Chemical Educa- 
tion and Research, Northern Illinois University, 

DeKalb, Illinois, July 23, 1971. 

6. P. B. Schipma: Term Fragment Analysis for Inver- 

sion of Large Files. Presented at the Association 
of Scientific Information Dissemination Centers 
Meeting, Washington, D. C., February 24, 1971. 
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13.2 Data Base Documentation 



American Institute of Physics 
New York, New York 

SPIN/O. A Magnetic Tape Service 
of the American Institute of Physics 

Bio-Sciences Information Service 
Philadelphia, Pennsylvania 

Guide to the Contents of BA Previews 

Chemical Abstracts Service 
Columbus , Ohio 

Data Content Specifications 
for CA Condensates in S.D.F. 

Data Content Specifications 

for the CA Integrated Subject File in S.D.F. 

Data Content Specifications 
for Chemical Titles in S.D.F. 

Data Content Specifications 

for Chemistry Industry Notes in S.D.F. 

Data Content Specifications 
for Patent Concordance in S.D.F. 

Data Content Specifications 

for Polymer Science & Technology in S.D.F. 

Standard Distribution Format (S.D.F.) 

Technical Specifications (revised) 

Clearinghouse for Federal Scientific and Technical Information 
bpnngfxeld, Virginia 

Clearinghouse Announcement Journal 
Available on Magnetic Tape 

ERIC Processing and Reference Facility 
Bethesda , Maryland 

ERIC Master Files, Magnetic Tape Formats 
MARC II Format of the ERIC Data Base 

INSPEC, The Institute of Electrical Engineers 
London , England 

Magnetic Tape Files Devices from the INSPEC Data Base 
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Institute for Scientific Information 
Philadelphia, Pennsylvania 

ISI Magnetic Tapes 

International Food Information Service 
Frankfort am Main, Germany 

IFIS Magnetic Tape Manual 

Library of Congress 
Washington, D.C. 

Subscriber's Guide to the MARC Distribution Service 
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14. CONCLUSION AND SUMMARY 



The IITRI CSC was begun in July 1968. The first year 
was spent in design and testing in preparation for providing 
information services from machine-readable data bases to users 
on a cost-recovery basis. Over the past three years CSC has 
profided SDI and retrospective search services to a vafied 
and dispersed group of users in industry, academia, and 
government. We designed the system to handle virtually any 
document- type data base--and it does --and the data bases we 
have used are BA, CA, and El. We have processed approximately 
600 profiles for 2500-3000 people, and we have searched more 
than 2 million citations ranging from 200-800 characters each. 
From this experience we have gathered statistical data, anal- 
yzed the data, and conducted research. Our findings both 
verify the design parameters and provide bases for monitoring 
and improving the overall sys tern- -including the data bases, 
software, profiles, users' reactions, and system operators as 
well as all of the interfaces between them. The work discussed 
in this report does not relate to hypothetical cases, research 
prototypes, or pilot studies. The report discusses what we 
have done and are doing, plus observations regarding the 
real life situation of providing services on a production 
basis to users who pay for the service. 

At present we have completed four years work under NSF 
Contract 554, and a no-cost time extension has been granted 
for continuing the contract through December 1972. Virtually 
all of the design research and development work has been 
completed, and the center is well on the way to becoming 
self-supporting. The major problems affecting marketing are, 
on the side of potential users, the lack of awareness and 
understanding of machine-readable data base and their poten- 
tial; and on the part of centers, the existence of duplicative 
efforts and coverage. Through the auspices of ASIDIC, we 
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look forward to resource sharing and informal networking as 
a means of improving the distribution of the available 
products to an as yet limited but potentially sizeable market. 
Machine -readable data bases are here to stay, and they fulfill 
a real need, but efforts regarding repackaging of data and 
development of new services from the data bases together with 
education of potential users is needed. 



